CORRELATION AND REGRESSION PDF
the linear relationship, and correlation, which measures the strength of a linear relationship. questions can be answered using regression and correlation. Chapter 12 Correlation and Regression. Activity 2. Measure the heights and weights of a random sample of 15 students of the same sex. Is there any apparent. 1 Correlation and Regression. Basic terms and concepts. 1. A scatter plot is a graphical representation of the relation between two or more variables. In.
|Language:||English, Spanish, Indonesian|
|Genre:||Politics & Laws|
|ePub File Size:||25.45 MB|
|PDF File Size:||9.26 MB|
|Distribution:||Free* [*Regsitration Required]|
STATISTICAL ECONOMIC AND SOCIAL RESEARCH. AND TRAINING CENTRE FOR ISLAMIC COUNTRIES. Correlation and Regression Analysis. TEXTBOOK. Linear Regression refers to a group of techniques for fitting and studying the straight-line correlation is a parameter of the bivariate normal distribution. The simplest forms of regression and correlation are still incomprehensible formulas to most beginning students. The application of the technique is also often.
When r is 0.
When r is positive, there is a trend that one variable goes up as the other one goes up. When r is negative, there is a trend that one variable goes up as the other one goes down. Linear regression finds the best line that predicts Y from X. Correlation does not fit a line. What kind of data? Correlation is almost always used when you measure both variables. It rarely is appropriate when one variable is something you experimentally manipulate.
Linear regression is usually used when X is a variable you manipulate time, concentration, etc. Does it matter which variable is X and which is Y? We want to estimate the underlying linear relationship so that we can predict ln urea and hence urea for a given age. Regression can be used to find the equation of this line.
This line is usually referred to as the regression line. Note that in a scatter diagram the response variable is always plotted on the vertical y axis. The gradient of this line is 0. The predicted ln urea of a patient aged 60 years, for example, is 0.
This transforms to a urea level of e 1. The y intercept is 0.
Statistics review 7: Correlation and regression
The regression line is obtained using the method of least squares. For a particular value of x the vertical difference between the observed and fitted value of y is known as the deviation, or residual Fig. The method of least squares finds the values of a and b that minimise the sum of the squares of all the deviations. This gives the following formulae for calculating a and b:.
Usually, these values would be calculated using a statistical package or the statistical functions on a calculator. We can test the null hypotheses that the population intercept and gradient are each equal to 0 using test statistics given by the estimate of the coefficient divided by its standard error.
The test statistics are compared with the t distribution on n - 2 sample size - number of regression coefficients degrees of freedom [ 4 ]. The P value for the coefficient of ln urea 0.
The coefficient of ln urea is the gradient of the regression line and its hypothesis test is equivalent to the test of the population correlation coefficient discussed above.
The P value for the constant of 0. Although the intercept is not significant, it is still appropriate to keep it in the equation. There are some situations in which a straight line passing through the origin is known to be appropriate for the data, and in this case a special regression analysis can be carried out that omits the constant [ 6 ].
Regression parameter estimates, P values and confidence intervals for the accident and emergency unit data. As stated above, the method of least squares minimizes the sum of squares of the deviations of the points about the regression line. Consider the small data set illustrated in Fig. This figure shows that, for a particular value of x, the distance of y from the mean of y the total deviation is the sum of the distance of the fitted y value from the mean the deviation explained by the regression and the distance from y to the line the deviation not explained by the regression.
The sum of squared deviations can be compared with the total variation in y, which is measured by the sum of squares of the deviations of y from the mean of y. The explained sum of squares is referred to as the 'regression sum of squares' and the unexplained sum of squares is referred to as the 'residual sum of squares'.
Small data set with the fitted values from the regression, the deviations and their sums of squares. The mean squares are the sums of squares divided by their degrees of freedom.
If there were no linear relationship between the variables then the regression mean squares would be approximately the same as the residual mean squares. We can test the null hypothesis that there is no linear relationship using an F test. The test statistic is calculated as the regression mean square divided by the residual mean square, and a P value may be obtained by comparison of the test statistic with the F distribution with 1 and n - 2 degrees of freedom [ 2 ].
Log in to Wiley Online Library
Usually, this analysis is carried out using a statistical package that will produce an exact P value. In fact, the F test from the analysis of variance is equivalent to the t test of the gradient for regression with only one predictor.
This is not the case with more than one predictor, but this will be the subject of a future review. As discussed above, the test for gradient is also equivalent to that for the correlation, giving three tests with identical P values. Therefore, when there is only one predictor variable it does not matter which of these tests is used. Another useful quantity that can be obtained from the analysis of variance is the coefficient of determination R 2.
It is the proportion of the total variation in y accounted for by the regression model. Values of R 2 close to 1 imply that most of the variability in y is explained by the regression model. R 2 is the same as r 2 in regression when there is only one predictor variable. This may be due to inherent variability in ln urea or to other unknown factors that affect the level of ln urea.
The fitted value of y for a given value of x is an estimate of the population mean of y for that particular value of x. As such it can be used to provide a confidence interval for the population mean [ 3 ]. The fitted values change as x changes, and therefore the confidence intervals will also change. The standard error is given by:. This transforms to urea values of 4. The fitted value for y also provides a predicted value for an individual, and a prediction interval or reference range [ 3 ] can be obtained Fig.
The prediction interval is calculated in the same way as the confidence interval but the standard error is given by:.
This transforms to urea values of 2. Both confidence intervals and prediction intervals become wider for values of the predictor variable further from the mean. The use of correlation and regression depends on some underlying assumptions. The observations are assumed to be independent. For correlation both variables should be random variables, but for regression only the response variable y must be random.
In carrying out hypothesis tests or calculating confidence intervals for the regression parameters, the response variable should have a Normal distribution and the variability of y should be the same for each value of the predictor variable.
The same assumptions are needed in testing the null hypothesis that the correlation is 0, but in order to interpret confidence intervals for the correlation coefficient both variables must be Normally distributed.
Both correlation and regression assume that the relationship between the two variables is linear. A scatter diagram of the data provides an initial check of the assumptions for regression.
The assumptions can be assessed in more detail by looking at plots of the residuals [ 4 , 7 ]. Commonly, the residuals are plotted against the fitted values.
If the relationship is linear and the variability constant, then the residuals should be evenly scattered around 0 along the range of fitted values Fig. In addition, a Normal plot of residuals can be produced.
This is a plot of the residuals against the values they would be expected to take if they came from a standard Normal distribution Normal scores. The data are displayed in a scatter diagram in the figure below. Each point represents an x,y pair in this case the gestational age, measured in weeks, and the birth weight, measured in grams. Note that the independent variable is on the horizontal axis or X-axis , and the dependent variable is on the vertical axis or Y-axis.
The scatter plot shows a positive or direct association between gestational age and birth weight.
Statistics review 7: Correlation and regression
Infants with shorter gestational ages are more likely to be born with lower weights and infants with longer gestational ages are more likely to be born with higher weights. The formula for the sample correlation coefficient is where Cov x,y is the covariance of x and y defined as are the sample variances of x and y, defined as The variances of x and y measure the variability of the x scores and y scores around their respective sample means , considered separately.
The covariance measures the variability of the x,y pairs around the mean of x and mean of y, considered simultaneously. To compute the sample correlation coefficient, we need to compute the variance of gestational age, the variance of birth weight and also the covariance of gestational age and birth weight. We first summarize the gestational age data. The mean gestational age is: To compute the variance of gestational age, we need to sum the squared deviations or differences between each observed gestational age and the mean gestational age.
The computations are summarized below. The variance of gestational age is: Next, we summarize the birth weight data. The mean birth weight is: The variance of birth weight is computed just as we did for gestational age as shown in the table below. The variance of birth weight is: Next we compute the covariance, To compute the covariance of gestational age and birth weight, we need to multiply the deviation from the mean gestational age by the deviation from the mean birth weight for each participant i.
Notice that we simply copy the deviations from the mean gestational age and birth weight from the two tables above into the table below and multiply.The predicted ln urea of a patient aged 60 years, for example, is 0. Abstract The present review introduces methods of analyzing the relationship between two quantitative variables.
It rarely is appropriate when one variable is something you experimentally manipulate. This coefficient has the value between 0 and 1. In algebraic notation, if we have two variables x and y, and the data take the form of n pairs i. Correlation On a scatter diagram, the closer the points lie to a straight line, the stronger the linear relationship between two variables.
A single outlier may produce the same sort of effect. When using a regression equation for prediction, errors in prediction may not be just random but also be due to inadequacies in the model.