Mathematics Project Topics

Assessing the Performance of Penalized Regression Methods and the Classical Least Squares Method

Assessing the Performance of Penalized Regression Methods and the Classical Least Squares Method

Assessing the Performance of Penalized Regression Methods and the Classical Least Squares Method

Chapter One

 Aim and objectives of the study

The main aim of this research is to assess the performance and advantages of using LASSO, Elastic Net and CAEN methods over the classical regression methods. We hope to achieve this aim through the following objectives:

  1. Application of penalized regression methods of eliminating
  2. Identifying the variables that possess the characteristics of multicollinearity using the Variance Inflation Factor, and
  3. Identifying the number of variables selected by each of the penalized regression methodand the classical least squares.

CHAPTER TWO

LITERATURE REVIEW

 Introduction

This chapter seeks to review related literatures on the works of different scholars regarding the classical regression, penalized regression and their applications. Many authors have proposed several penalized regression methods to beat the defects of ordinary least squares with regards to prediction accuracy.

Classical Regression Analysis

Regression analysis is a statistical technique used to relate variables. Its basic aim is to build a mathematical model to relate dependent variables to independent variables. The method of least squares was firstdiscovered around 1805 (Stigler, 1986). There was a disagreement about who first discoveredthe method of least squares. It appears that, it was discovered independently by CarlFriedrich Gauss (1777-1855) and Adrien Marie Legendre (1752-1833), that Gauss startedusing it before 1803 (he claimed in about 1795, but there is no corroboration of this earlierdate), and that the first account was published by Legendre in 1805, as indicated in (Draper and Smith, 1981). Stigler (1986) noted that, Sir Francis Galton discovered regression around 1885 in his studies related to heredity. Any contemporary course in regression analysis today starts with themethods of least squares and its variations.

Multiple Linear Regression (MLR) is one of the most commonly used data miningtechniques, and can provide insight information in cases where the rigid assumptionsassociated with MLR are met.MLR is a very versatile tool and can be applied to almost any process,system, or area of study. Much has been published regarding this subject, and we refer an interested reader to

Kutner et al. (2004), as well as Myers (1990) which provided thorough accounts of MLR and will be indispensable for most readers.

A key step in developing an appropriate MLR model is selecting a method of modelbuilding and a set of best model criteria. Efromyson (1960) introduced Stepwise regression which iscommonly used for model building. Stepwise regressionwas intended to be an automated procedure that selects the most statistically significant variables from a finite pool of independent variables. There are three separate stepwiseregression procedures: forward selection, backward selection and mixedselection. Mixed selection is the most statistically defendable type of stepwise regression,and is a mixture of the forward and backward procedures as indicated in Kutner et al. (2004), Neter et al. (1996) as well as Draper and Smith (1981).

As noted by Kutner et al. (2004); model validation is the final step in the regressionmodeling- building process. Furthermore it was highlighted therein that, there are three main methods associated withmodel validation, as follows:

  1. Collection of new data to validate the current model and its
  2. Comparison of current results with other theoretical values, empirical and simulation results.
  3. Use of a cross-validation sample to validate and assess the predictive power of thecurrent model.

The cross-validation approach is used to assess the validity andpredictability of the regression models constructed, i.e., a certain amount of the data are removed from the model- building process say twentyrecords, and then use the constructed model to estimatetheir computed values. A general rule of thumb in regression model building is to use 80percent of the data set for the development of the training model and the remaining 20percent for validation of the model as noted by (Kutner et al. 2004). Validation records can be selectedat random from the entire data set, or in the case of time series data, the validationset can be the most current 20 percent (Kutner et al. 2004). Adequate regression models areexpected to yield estimates reasonably close to the actual data values. There are lots of statistics available to aid in assessing the predictive power ofregression models. A popular statistic for assessing this predictability is the Root MeanSquared Error of the Prediction (RMSEP) statistic (André et al. 2006). This statistic iscomputed by calculating the square root of the Sum Squared Errors (SSE) for the withheldrecords divided by the corresponding degrees of freedom. Lower RMSEP values indicatebetter model predictability. Another common model validation statistic is the classical coefficient of determination, or , statistic. This value is also computed for the withheldsample, and provides some insight into the predictability of the model. By definition, higher values are preferred, i.e., the statistic indicates the amount of variation explained bythe regressors in the regression model.

Breiman and Friedman (1997) observed that, when considering multiple regression models, it is of great importance for the predictors to share strength among different models.(Turlach et al. 2005)also observed that, it is of particular interest, when there are large number of covariates,to find a common set of variables that can be used for all models under investigation. In the context ofmean regression, Turlach et al. (2005) considered the problem of selecting a subset of 770 wavelengths that are suitable aspredictors for 14 different but correlated infra-red spectrometry measurements, and they proposed a novel regularizationmethod to perform simultaneous  variable selection.Because classical regression approaches require thenumber of samples to exceed the number of variables,they are not applicable in case of genome wide association (GWA) data. Additionally,least-squares estimates of regression coefficients may be highly unstable, especially in cases of correlated predictorvariables, which lead to low prediction accuracy.

In genomic settings, where collinear predictors typically outnumbered available samples (p>n), an examplebeing the prediction of cancer patient survival from tumor geneexpression data (Beer et al., 2002; Shedden et al., 2008; Sørlieet al., 2001; van de Vijver et al., 2002; Wigle et al., 2002). In thissetting, ordinary regression is subject to overfitting and instabilityof coefficients (Harrell et al., 1996), and stepwise variable selectionmethods do not scale well as observed by (Yuan and Lin, 2006). Regression has beensuccessfully adapted to high-dimensional situations by penalization methods (see for instance, Hesterberg 2008), and penalized regressionhas been shown to outperform univariate and other multivariateregression methods in multiple genomic datasets (Bøvelstad et al., 2007).

Many simulation studieshave been carried out and it was suggested that, least-squares estimates can be quite poor, see for instance Roecker (1991), Adams (1990) as well as Hurvich and Tsai (1990). Thesestudies show that often prediction errors using OLS are too small and that the usual 95% confidence intervals will only include the true value of the parameterin roughly 50% of cases. When predictor variables are strongly correlated, theprediction errors were shown to become too large.

 

CHAPTER THREE

METHODOLOGY

 Regression Analysis

Regression analysis is one of the most important tools for analyzing relationships between one response variable and one or more explanatory variables. It is widely used in our day-to-day endeavors, ranging in diverse areas of human life including social and biological sciences, economics and so on. Regression analysis has become one of the most important tool in data analysis.

The use of regression analysis has significant applications in medical and countless other research areas, and is an important component of modern data analysis. The central objective is to understand the relationship between a response (or dependent) variable and a set of predictor variables (also known as explanatory variables, regressors, covariates, or independent variables) and to apply the relationship for the purpose of estimating and/or predicting future responses. There are many important theoretical, practical and computational issues related to regression modeling and inference, including specification of the link function that relates the response variable and predictor variables, estimation of regression parameters in the link function, measure of model performance, diagnostic statistics to assess the modeling assumptions and goodness-of-fit, as well as remedial methods in the cases of violation of assumptions.

The response variable can be continuous or categorical. Although some philosophical ideas may be similar for different types of response variables, methodologies are different, in particular on the choice of the link function and assessment of the model’s goodness-of-fit.Effective model building is a significant issue. Essentially, we search for the best fitting and most parsimonious model that is practically meaningful and reasonable to describe the relationship between the response and the set of predictor variables. The fit of the model to the data set is determined by measures of goodness-of-fit, and being most parsimonious requires effective methods of model selection.

CHAPTER FOUR

RESULTS AND DISCUSSION

 Introduction

This chapter seeks to compare the performances of LASSO, Elastic Net and CAEN penalized methods using numerical results.

 

CHAPTER FIVE

SUMMARY, CONCLUSION AND RECOMMENDATION

  Introduction

In this chapter, we present the summary, conclusion and recommendation based on the results obtained from this research work.

 Summary

This research work was aimed at comparing the performances of LASSO, Elastic Net and CAEN regression methods using numerical results. We have applied LASSO, Elastic Net and CAEN methods to the diabetes dataset.From table 4.10 in chapter four it can be seen that the characteristics of each of the methods were observed carefully. The LASSO regression does both the shrinkage and variable selection and there are 7non-zero variables in the final model. The Elastic Net regression also does both the shrinkage and variable selection and there are 8 nonzero variables in the final model. CAEN selects 7 nonzero variables in the final model. According to numerical results, the Elastic Net regression gives a smaller but a larger , the LASSO regression gives a larger          and a smaller . Also the CAEN gives a smaller

but a larger . According to the dataset CAEN outperforms LASSO and Elastic Net regressions in terms of mean square error and it produced a less complex model than the other two methods.

Conclusion

To establish an accurate model, one needs to collect numerous variables.Unfortunately, those variables are often highly correlated. As we have discussedin this thesis, those variables that are correlated makes the model less predictive and difficult to interpret. Therefore a penalized regression method provides a better way of selecting the appropriate variables to establish an effective model as observed in this thesis.

Recommendations and suggestion for further study

Penalized regression techniques for linear regression have been created in the last few decades to reduce the flaws of ordinary least squares regression with regard to prediction accuracy. Multicollinearity is a problem seldom considered in elementary statistics texts, because it is not really a mathematical-statistical problem, but it is rather a problem in the interpretation of the coefficients. While not extensively considered, however, it is a problem that confronts researchers in actual data analytic situations. Therefore researchers should always be sensitive to the possibility of the problem. And since the multicollinearity diagnostics are so easily obtained, no one should ever report results of regressions with multicollinearity problems. Based on the diabetes dataset The CAEN performs better compared to the other two methods. CAEN can also be applied to survival data, since there are lots of variables in many survival data analysis problems.

Contribution to Knowledge

  1. This research work was able to compare the newly introduced Correlation Adjusted Elastic Net with the existing LASSO and Elastic net regressions. Where CAEN regression outperforms the other two methods in terms of mean square error (MSE) based on thediabetes dataset used.
  2. This research has further deepened the discrepancies among the penalized regression methods considered, thereby providing assistance to researchers to ease their decision making as to which technique to be usedwhen encountered with the problem of multicollinearity.

REFERENCES

  • Adams, J.(1990). A computer experiment to evaluate regression strategies. Proceedings of the Statistical Computing Section. American Statistical Association, 3(4): 55-62.
  • André, N., Young,T.M., and Rials,T.G. (2006).Online monitoring of the buffer capacity of Particleboard furnish by near-infrared spectroscopy,Applied Spectroscopy, 60(13):1204-1209.
  • Ayers,K.L.,andCordell,H.J.(2010).SNPselectioningenome-wideandcandidate gene studies viapenalizedlogisticregression. GeneticEpidemiology,34(6):879–891.
  • Beer, D.G., Kardia, S.L., Huang, C.C., Giordano, T.J., Levin, A.M. (2002). Gene-expression profiles predict survival of patients with lungadenocarcinoma. National Medical, 8(2):816–824.
  • Bondell, H.D. and Reich, B.J. (2008). Simultaneous regression shrinkage,variable selection and clustering of predictors with OSCAR.Biometrics,64(5):115-123.
  • Bøvelstad, H.M., Nygard, S., Storvold, H.L., Aldrin, M., Borgan, O., Frigessi, A.(2007) Predicting survival from microarray data a comparativestudy.Bioinformatics, 23(12): 2080–2087.
  • Breiman, L.(1996). Heuristics of instability and stabilization in model selection. Annals of Statistics,24(3):2350-2383.
  • Breiman, L., Friedman, J., 1997. Predicting multiple responses in multiple linear regression (with discussion). Journal of the Royal Statistical Society, SeriesB (59): 3–54.
  • Buhlmann, P. and VandeGeer, S. (2011). Statistics for High Dimensional Data, Springer- verlag.New York, pp. 565-596.
  • Candes, E and Tao, T (2007). The Dantzig Selector: Statistical Estimation When p is much Larger than n. The Annals of Statistics, 35(6): 2313-2351.
WeCreativez WhatsApp Support
Our customer support team is here to answer your questions. Ask us anything!