The regression coefficients are under the condition. Regression Equation Coefficient Shows Correlation and Regression Analysis

REGRESSION COEFFICIENT

- English coefficient, regression; German Regressionskoeffizient. One of the characteristics of the relationship between dependent y and independent variable x. K. r. shows by how many units the value accepted by y increases if the variable x changes by one unit of its change. Geometrically, K. r. is the slope of the straight line y.

Antinazi. Encyclopedia of Sociology, 2009

See what "REGRESSION COEFFICIENT" is in other dictionaries:

    regression coefficient- - [L.G. Sumenko. English Russian Dictionary of Information Technologies. M.: GP TsNIIS, 2003.] Topics Information Technology in general EN regression coefficient … Technical Translator's Handbook

    Regression coefficient- 35. Regression coefficient Parameter of the regression analysis model Source: GOST 24026 80: Research tests. Experiment planning. Terms and Definitions …

    regression coefficient- The coefficient of the independent variable in the regression equation ... Dictionary of Sociological Statistics

    REGRESSION COEFFICIENT- English. coefficient, regression; German Regressionskoeffizient. One of the characteristics of the relationship between dependent y and independent variable x. K. r. shows by how many units the value accepted by y increases if the variable x changes to ... ... Dictionary in sociology

    sample regression coefficient- 2.44. sample regression coefficient Coefficient of a variable in a regression curve or surface equation Source: GOST R 50779.10 2000: Statistical methods. Probability and bases of statistics. Terms and Definitions … Dictionary-reference book of terms of normative and technical documentation

    Partial regression coefficient- a statistical measure that indicates the degree of influence of the independent variable on the dependent in a situation where the mutual influence of all other variables in the model is under the control of the researcher ... Sociological Dictionary Socium

    REGRESSIONS, WEIGHT- A synonym for the concept of regression coefficient ... Explanatory Dictionary of Psychology

    HERITABILITY COEFFICIENT- An indicator of the relative share of genetic variability in the overall phenotypic variation of a trait. The most common methods for assessing the heritability of economically useful traits are: where h2 is the heritability coefficient; r intraclass… … Terms and definitions used in breeding, genetics and reproduction of farm animals

    - (R squared) is the proportion of the variance of the dependent variable that is explained by the dependence model in question, that is, the explanatory variables. More precisely, this is one minus the proportion of unexplained variance (the variance of the random error of the model, or conditional ... ... Wikipedia

    The coefficient of the independent variable in the regression equation. So, for example, in the equation linear regression, linking random variables Y and X, R. c. b0 and b1 are equal: where r is the correlation coefficient of X and Y, . Calculation of estimates R. k. Mathematical Encyclopedia

Books

  • Introduction to econometrics (CDpc), Yanovsky Leonid Petrovich, Bukhovets Alexey Georgievich. The foundations of econometrics and statistical analysis of one-dimensional time series are given. Much attention is paid to classical pair and multiple regression, classical and generalized methods…
  • Speed ​​reading. Effective Simulator (CDpc) , . The program is intended for users who wish to as soon as possible master the technique of speed reading. The course is built on the principle of "theory - practice". Theoretical material and practical ...

What is regression?

Consider two continuous variables x=(x 1 , x 2 , .., x n), y=(y 1 , y 2 , ..., y n).

Let's place the points on a 2D scatter plot and say we have linear relationship if the data is approximated by a straight line.

If we assume that y depends on x, and the changes in y caused by changes in x, we can define a regression line (regression y on the x), which best describes the straight-line relationship between these two variables.

The statistical use of the word "regression" comes from a phenomenon known as regression to the mean, attributed to Sir Francis Galton (1889).

He showed that while tall fathers tend to have tall sons, the average height of sons is smaller than that of their tall fathers. The average height of sons "regressed" and "moved back" to the average height of all fathers in the population. Thus, on average, tall fathers have shorter (but still tall) sons, and short fathers have taller (but still rather short) sons.

regression line

Mathematical equation that evaluates a simple (pairwise) linear regression line:

x called the independent variable or predictor.

Y is the dependent or response variable. This is the value we expect for y(on average) if we know the value x, i.e. is the predicted value y»

  • a- free member (crossing) of the evaluation line; this value Y, when x=0(Fig.1).
  • b- slope or gradient of the estimated line; it is the amount by which Y increases on average if we increase x for one unit.
  • a and b are called the regression coefficients of the estimated line, although this term is often used only for b.

Pairwise linear regression can be extended to include more than one independent variable; in this case it is known as multiple regression.

Fig.1. Linear regression line showing the intersection of a and the slope b (the amount of increase in Y when x increases by one unit)

Least square method

We perform regression analysis using a sample of observations where a and b- sample estimates of the true (general) parameters, α and β , which determine the line of linear regression in the population (general population).

Most simple method determining coefficients a and b is method least squares (MNK).

The fit is evaluated by considering the residuals (the vertical distance of each point from the line, e.g. residual = observable y- predicted y, Rice. 2).

The line of best fit is chosen so that the sum of the squares of the residuals is minimal.

Rice. 2. Linear regression line with depicted residuals (vertical dotted lines) for each point.

Linear Regression Assumptions

So, for each observed value, the residual is equal to the difference and the corresponding predicted one. Each residual can be positive or negative.

You can use residuals to test the following assumptions behind linear regression:

  • The residuals are normally distributed with zero mean;

If the assumptions of linearity, normality, and/or constant variance are questionable, we can transform or and calculate a new regression line for which these assumptions are satisfied (eg, use a logarithmic transformation, etc.).

Abnormal values ​​(outliers) and points of influence

An "influential" observation, if omitted, changes one or more model parameter estimates (ie slope or intercept).

An outlier (an observation that contradicts most of the values ​​in the dataset) can be an "influential" observation and can be well detected visually when looking at a 2D scatterplot or a plot of residuals.

Both for outliers and for "influential" observations (points), models are used, both with their inclusion and without them, pay attention to the change in the estimate (regression coefficients).

When doing an analysis, do not automatically discard outliers or influence points, as simply ignoring them can affect the results. Always study the causes of these outliers and analyze them.

Linear regression hypothesis

When constructing a linear regression, the null hypothesis is checked that the general slope of the regression line β is equal to zero.

If the slope of the line is zero, there is no linear relationship between and: the change does not affect

To test the null hypothesis that the true slope is zero, you can use the following algorithm:

Calculate the test statistic equal to the ratio , which obeys a distribution with degrees of freedom, where the standard error of the coefficient


,

- estimation of the variance of the residuals.

Usually, if the significance level reached is the null hypothesis is rejected.


where is the percentage point of the distribution with degrees of freedom which gives the probability of a two-tailed test

This is the interval that contains the general slope with a probability of 95%.

For large samples, let's say we can approximate with a value of 1.96 (that is, the test statistic will tend to be normally distributed)

Evaluation of the quality of linear regression: coefficient of determination R 2

Because of the linear relationship and we expect that changes as changes , and we call this the variation that is due to or explained by the regression. The residual variation should be as small as possible.

If so, then most of variation will be explained by regression, and the points will lie close to the regression line, i.e. the line fits the data well.

The proportion of the total variance that is explained by the regression is called determination coefficient, usually expressed as a percentage and denoted R2(in paired linear regression, this is the value r2, the square of the correlation coefficient), allows you to subjectively assess the quality of the regression equation.

The difference is the percentage of variance that cannot be explained by regression.

With no formal test to evaluate, we are forced to rely on subjective judgment to determine the quality of the fit of the regression line.

Applying a Regression Line to a Forecast

You can use a regression line to predict a value from a value within the observed range (never extrapolate beyond these limits).

We predict the mean for observables that have a certain value by substituting that value into the regression line equation.

So, if predicting as We use this predicted value and its standard error to estimate the confidence interval for the true population mean.

Repeating this procedure for different values ​​allows you to build confidence limits for this line. This is a band or area that contains a true line, for example, with a 95% confidence level.

Simple regression plans

Simple regression designs contain one continuous predictor. If there are 3 cases with predictor values ​​P , such as 7, 4 and 9, and the design includes a first order effect P , then the design matrix X will be

and the regression equation using P for X1 looks like

Y = b0 + b1 P

If a simple regression plan contains the effect higher order for P , such as a quadratic effect, then the values ​​in column X1 in the design matrix will be raised to the second power:

and the equation will take the form

Y = b0 + b1 P2

Sigma-restricted and overparameterized coding methods do not apply to simple regression designs and other designs containing only continuous predictors (because there are simply no categorical predictors). Regardless of the encoding method chosen, the values ​​of the continuous variables are incremented by the appropriate power and used as the values ​​for the X variables. In this case, no conversion is performed. In addition, when describing regression plans, you can omit consideration of the plan matrix X, and work only with the regression equation.

Example: Simple Regression Analysis

This example uses the data provided in the table:

Rice. 3. Table of initial data.

The data is based on a comparison of the 1960 and 1970 censuses in 30 randomly selected counties. County names are represented as observation names. Information regarding each variable is presented below:

Rice. 4. Variable specification table.

Research objective

For this example, the correlation between the poverty rate and the power that predicts the percentage of families that are below the poverty line will be analyzed. Therefore, we will treat variable 3 (Pt_Poor ) as a dependent variable.

One can put forward a hypothesis: the change in the population and the percentage of families that are below the poverty line are related. It seems reasonable to expect that poverty leads to an outflow of population, hence there would be a negative correlation between the percentage of people below the poverty line and population change. Therefore, we will treat variable 1 (Pop_Chng ) as a predictor variable.

View Results

Regression coefficients

Rice. 5. Regression coefficients Pt_Poor on Pop_Chng.

At the intersection of the Pop_Chng row and Param. the non-standardized coefficient for the regression of Pt_Poor on Pop_Chng is -0.40374 . This means that for every unit decrease in population, there is an increase in the poverty rate of .40374. The upper and lower (default) 95% confidence limits for this non-standardized coefficient do not include zero, so the regression coefficient is significant at the p level<.05 . Обратите внимание на не стандартизованный коэффициент, который также является коэффициентом корреляции Пирсона для простых регрессионных планов, равен -.65, который означает, что для каждого уменьшения стандартного отклонения численности населения происходит увеличение стандартного отклонения уровня бедности на.65.

Distribution of variables

Correlation coefficients can become significantly overestimated or underestimated if there are large outliers in the data. Let us examine the distribution of the dependent variable Pt_Poor by county. To do this, we will build a histogram of the Pt_Poor variable.

Rice. 6. Histogram of the Pt_Poor variable.

As you can see, the distribution of this variable differs markedly from the normal distribution. However, although even two counties (the right two columns) have a higher percentage of families that are below the poverty line than expected in a normal distribution, they appear to be "inside the range."

Rice. 7. Histogram of the Pt_Poor variable.

This judgment is somewhat subjective. The rule of thumb is that outliers must be accounted for if an observation (or observations) does not fall within the interval (mean ± 3 times standard deviation). In this case, it is worth repeating the analysis with and without outliers to make sure that they do not have a serious effect on the correlation between members of the population.

Scatterplot

If one of the hypotheses is a priori about the relationship between the given variables, then it is useful to check it on the plot of the corresponding scatterplot.

Rice. 8. Scatterplot.

The scatterplot shows a clear negative correlation (-.65) between the two variables. It also shows the 95% confidence interval for the regression line, i.e., with 95% probability the regression line passes between the two dashed curves.

Significance criteria

Rice. 9. Table containing the significance criteria.

The test for the Pop_Chng regression coefficient confirms that Pop_Chng is strongly related to Pt_Poor , p<.001 .

Outcome

This example showed how to analyze a simple regression plan. An interpretation of non-standardized and standardized regression coefficients was also presented. The importance of studying the response distribution of the dependent variable is discussed, and a technique for determining the direction and strength of the relationship between the predictor and the dependent variable is demonstrated.

Regression analysis is a statistical research method that allows you to show the dependence of a parameter on one or more independent variables. In the pre-computer era, its use was quite difficult, especially when it came to large amounts of data. Today, having learned how to build a regression in Excel, you can solve complex statistical problems in just a couple of minutes. Below are specific examples from the field of economics.

Types of regression

The concept itself was introduced into mathematics in 1886. Regression happens:

  • linear;
  • parabolic;
  • power;
  • exponential;
  • hyperbolic;
  • demonstrative;
  • logarithmic.

Example 1

Consider the problem of determining the dependence of the number of retired team members on the average salary at 6 industrial enterprises.

A task. At six enterprises, we analyzed the average monthly salary and the number of employees who left of their own free will. In tabular form we have:

The number of people who left

Salary

30000 rubles

35000 rubles

40000 rubles

45000 rubles

50000 rubles

55000 rubles

60000 rubles

For the problem of determining the dependence of the number of retired workers on the average salary at 6 enterprises, the regression model has the form of the equation Y = a 0 + a 1 x 1 +…+a k x k , where x i are the influencing variables, a i are the regression coefficients, a k is the number of factors.

For this task, Y is the indicator of employees who left, and the influencing factor is the salary, which we denote by X.

Using the capabilities of the spreadsheet "Excel"

Regression analysis in Excel must be preceded by the application of built-in functions to the available tabular data. However, for these purposes, it is better to use the very useful add-in "Analysis Toolkit". To activate it you need:

  • from the "File" tab, go to the "Options" section;
  • in the window that opens, select the line "Add-ons";
  • click on the "Go" button located at the bottom, to the right of the "Management" line;
  • check the box next to the name "Analysis Package" and confirm your actions by clicking "OK".

If everything is done correctly, the desired button will appear on the right side of the Data tab, located above the Excel worksheet.

in Excel

Now that we have at hand all the necessary virtual tools for performing econometric calculations, we can begin to solve our problem. For this:

  • click on the "Data Analysis" button;
  • in the window that opens, click on the "Regression" button;
  • in the tab that appears, enter the range of values ​​for Y (the number of employees who quit) and for X (their salaries);
  • We confirm our actions by pressing the "Ok" button.

As a result, the program will automatically populate a new sheet of the spreadsheet with regression analysis data. Note! Excel has the ability to manually set the location you prefer for this purpose. For example, it could be the same sheet where the Y and X values ​​are, or even a new workbook specifically designed to store such data.

Analysis of regression results for R-square

In Excel, the data obtained during the processing of the data of the considered example looks like this:

First of all, you should pay attention to the value of the R-square. It is the coefficient of determination. In this example, R-square = 0.755 (75.5%), i.e., the calculated parameters of the model explain the relationship between the considered parameters by 75.5%. The higher the value of the coefficient of determination, the more applicable the chosen model for a particular task. It is believed that it correctly describes the real situation with an R-squared value above 0.8. If R-squared<0,5, то такой анализа регрессии в Excel нельзя считать резонным.

Ratio Analysis

The number 64.1428 shows what the value of Y will be if all the variables xi in the model we are considering are set to zero. In other words, it can be argued that the value of the analyzed parameter is also influenced by other factors that are not described in a specific model.

The next coefficient -0.16285, located in cell B18, shows the weight of the influence of variable X on Y. This means that the average monthly salary of employees within the model under consideration affects the number of quitters with a weight of -0.16285, i.e. the degree of its influence at all small. The "-" sign indicates that the coefficient has a negative value. This is obvious, since everyone knows that the higher the salary at the enterprise, the less people express a desire to terminate the employment contract or quit.

Multiple Regression

This term refers to a connection equation with several independent variables of the form:

y \u003d f (x 1 + x 2 + ... x m) + ε, where y is the effective feature (dependent variable), and x 1 , x 2 , ... x m are the factor factors (independent variables).

Parameter Estimation

For multiple regression (MR) it is carried out using the method of least squares (OLS). For linear equations of the form Y = a + b 1 x 1 +…+b m x m + ε, we construct a system of normal equations (see below)

To understand the principle of the method, consider the two-factor case. Then we have a situation described by the formula

From here we get:

where σ is the variance of the corresponding feature reflected in the index.

LSM is applicable to the MP equation on a standardizable scale. In this case, we get the equation:

where t y , t x 1, … t xm are standardized variables for which the mean values ​​are 0; β i are the standardized regression coefficients, and the standard deviation is 1.

Please note that all β i in this case are set as normalized and centralized, so their comparison with each other is considered correct and admissible. In addition, it is customary to filter out factors, discarding those with the smallest values ​​of βi.

Problem using linear regression equation

Suppose there is a table of the price dynamics of a particular product N during the last 8 months. It is necessary to make a decision on the advisability of purchasing its batch at a price of 1850 rubles/t.

month number

month name

price of item N

1750 rubles per ton

1755 rubles per ton

1767 rubles per ton

1760 rubles per ton

1770 rubles per ton

1790 rubles per ton

1810 rubles per ton

1840 rubles per ton

To solve this problem in the Excel spreadsheet, you need to use the Data Analysis tool already known from the above example. Next, select the "Regression" section and set the parameters. It must be remembered that in the "Input interval Y" field, a range of values ​​for the dependent variable (in this case, the price of a product in specific months of the year) must be entered, and in the "Input interval X" - for the independent variable (month number). Confirm the action by clicking "Ok". On a new sheet (if it was indicated so), we get data for regression.

Based on them, we build a linear equation of the form y=ax+b, where the parameters a and b are the coefficients of the row with the name of the month number and the coefficients and the “Y-intersection” row from the sheet with the results of the regression analysis. Thus, the linear regression equation (LE) for problem 3 is written as:

Product price N = 11.714* month number + 1727.54.

or in algebraic notation

y = 11.714 x + 1727.54

Analysis of results

To decide whether the resulting linear regression equation is adequate, multiple correlation coefficients (MCC) and determination coefficients are used, as well as Fisher's test and Student's test. In the Excel table with regression results, they appear under the names of multiple R, R-square, F-statistic and t-statistic, respectively.

KMC R makes it possible to assess the tightness of the probabilistic relationship between the independent and dependent variables. Its high value indicates a fairly strong relationship between the variables "Number of the month" and "Price of goods N in rubles per 1 ton". However, the nature of this relationship remains unknown.

The square of the coefficient of determination R 2 (RI) is a numerical characteristic of the share of the total scatter and shows the scatter of which part of the experimental data, i.e. values ​​of the dependent variable corresponds to the linear regression equation. In the problem under consideration, this value is equal to 84.8%, i.e., the statistical data are described with a high degree of accuracy by the obtained SD.

F-statistics, also called Fisher's test, is used to assess the significance of a linear relationship, refuting or confirming the hypothesis of its existence.

(Student's criterion) helps to evaluate the significance of the coefficient with an unknown or free term of a linear relationship. If the value of the t-criterion > t cr, then the hypothesis of the insignificance of the free term linear equation rejected.

In the problem under consideration for the free member, using the Excel tools, it was obtained that t = 169.20903, and p = 2.89E-12, i.e. we have a zero probability that the correct hypothesis about the insignificance of the free member will be rejected. For the coefficient at unknown t=5.79405, and p=0.001158. In other words, the probability that the correct hypothesis about the insignificance of the coefficient for the unknown will be rejected is 0.12%.

Thus, it can be argued that the resulting linear regression equation is adequate.

The problem of the expediency of buying a block of shares

Multiple regression in Excel is performed using the same Data Analysis tool. Consider a specific applied problem.

The management of NNN must make a decision on the advisability of purchasing a 20% stake in MMM SA. The cost of the package (JV) is 70 million US dollars. NNN specialists collected data on similar transactions. It was decided to evaluate the value of the block of shares according to such parameters, expressed in millions of US dollars, as:

  • accounts payable (VK);
  • annual turnover (VO);
  • accounts receivable (VD);
  • cost of fixed assets (SOF).

In addition, the parameter payroll arrears of the enterprise (V3 P) in thousands of US dollars is used.

Solution using Excel spreadsheet

First of all, you need to create a table of initial data. It looks like this:

  • call the "Data Analysis" window;
  • select the "Regression" section;
  • in the box "Input interval Y" enter the range of values ​​of dependent variables from column G;
  • click on the icon with a red arrow to the right of the "Input interval X" window and select the range of all values ​​​​from columns B, C, D, F on the sheet.

Select "New Worksheet" and click "Ok".

Get the regression analysis for the given problem.

Examination of the results and conclusions

“We collect” from the rounded data presented above on the Excel spreadsheet sheet, the regression equation:

SP \u003d 0.103 * SOF + 0.541 * VO - 0.031 * VK + 0.405 * VD + 0.691 * VZP - 265.844.

In a more familiar mathematical form, it can be written as:

y = 0.103*x1 + 0.541*x2 - 0.031*x3 +0.405*x4 +0.691*x5 - 265.844

Data for JSC "MMM" are presented in the table:

Substituting them into the regression equation, they get a figure of 64.72 million US dollars. This means that the shares of JSC MMM should not be purchased, since their value of 70 million US dollars is rather overstated.

As you can see, the use of the Excel spreadsheet and the regression equation made it possible to make an informed decision regarding the feasibility of a very specific transaction.

Now you know what regression is. The examples in Excel discussed above will help you solve practical problems from the field of econometrics.

In the previous notes, the focus has often been on a single numerical variable, such as mutual fund returns, Web page load time, or soft drink consumption. In this and the following notes, we will consider methods for predicting the values ​​of a numeric variable depending on the values ​​of one or more other numeric variables.

The material will be illustrated with a through example. Forecasting sales volume in a clothing store. The Sunflowers chain of discount clothing stores has been constantly expanding for 25 years. However, the company does not currently have a systematic approach to selecting new outlets. The location where the company intends to open a new store is determined based on subjective considerations. The selection criteria are favorable rental conditions or the manager's idea of ​​the ideal location of the store. Imagine that you are the head of the Special Projects and Planning Department. You have been tasked with developing a strategic plan for opening new stores. This plan should contain a forecast of annual sales in newly opened stores. You believe that selling space is directly related to revenue and want to factor that fact into your decision making process. How do you develop a statistical model that predicts annual sales based on new store size?

Typically, regression analysis is used to predict the values ​​of a variable. Its goal is to develop a statistical model that predicts the values ​​of the dependent variable, or response, from the values ​​of at least one independent, or explanatory, variable. In this note, we will consider a simple linear regression - a statistical method that allows you to predict the values ​​of the dependent variable Y by the values ​​of the independent variable X. The following notes will describe a multiple regression model designed to predict the values ​​of the independent variable Y by the values ​​of several dependent variables ( X 1 , X 2 , …, X k).

Download note in or format, examples in format

Types of regression models

where ρ 1 is the autocorrelation coefficient; if ρ 1 = 0 (no autocorrelation), D≈ 2; if ρ 1 ≈ 1 (positive autocorrelation), D≈ 0; if ρ 1 = -1 (negative autocorrelation), D ≈ 4.

In practice, the application of the Durbin-Watson criterion is based on a comparison of the value D with critical theoretical values dL and d U for a given number of observations n, the number of independent variables of the model k(for simple linear regression k= 1) and significance level α. If a D< d L , the hypothesis of independence of random deviations is rejected (hence, there is a positive autocorrelation); if D > dU, the hypothesis is not rejected (that is, there is no autocorrelation); if dL< D < d U there is not enough reason to make a decision. When the calculated value D exceeds 2, then dL and d U it is not the coefficient itself that is being compared D, and the expression (4 – D).

To calculate the Durbin-Watson statistics in Excel, we turn to the bottom table in Fig. fourteen Balance withdrawal. The numerator in expression (10) is calculated using the function = SUMMQDIFF(array1, array2), and the denominator = SUMMQ(array) (Fig. 16).

Rice. 16. Formulas for calculating Durbin-Watson statistics

In our example D= 0.883. The main question is: what value of the Durbin-Watson statistic should be considered small enough to conclude that there is a positive autocorrelation? It is necessary to correlate the value of D with the critical values ​​( dL and d U) depending on the number of observations n and significance level α (Fig. 17).

Rice. 17. Critical values ​​of Durbin-Watson statistics (table fragment)

Thus, in the problem of the volume of sales in a store delivering goods to your home, there is one independent variable ( k= 1), 15 observations ( n= 15) and significance level α = 0.05. Consequently, dL= 1.08 and dU= 1.36. Because the D = 0,883 < dL= 1.08, there is a positive autocorrelation between the residuals, the least squares method cannot be applied.

Testing Hypotheses about Slope and Correlation Coefficient

The above regression was applied solely for forecasting. To determine regression coefficients and predict the value of a variable Y for a given variable value X the method of least squares was used. In addition, we considered the standard error of the estimate and the coefficient of mixed correlation. If the residual analysis confirms that the applicability conditions of the least squares method are not violated, and the simple linear regression model is adequate, based on the sample data, it can be argued that there is a linear relationship between the variables in the population.

Applicationt -criteria for slope. By checking whether the population slope β 1 is equal to zero, one can determine whether there is a statistically significant relationship between the variables X and Y. If this hypothesis is rejected, it can be argued that between the variables X and Y there is a linear relationship. The null and alternative hypotheses are formulated as follows: H 0: β 1 = 0 (no linear relationship), H1: β 1 ≠ 0 (there is a linear relationship). By definition t-statistic is equal to the difference between the sample slope and the hypothetical population slope, divided by the standard error of the slope estimate:

(11) t = (b 1 β 1 ) / Sb 1

where b 1 is the slope of the direct regression based on sample data, β1 is the hypothetical slope of the direct general population, , and test statistics t It has t- distribution with n - 2 degrees of freedom.

Let's check if there is a statistically significant relationship between store size and annual sales at α = 0.05. t-criteria is displayed along with other parameters when using Analysis package(option Regression). The full results of the Analysis Package are shown in Fig. 4, a fragment related to t-statistics - in fig. eighteen.

Rice. 18. Application results t

Because the number of stores n= 14 (see Fig. 3), critical value t-statistics at a significance level α = 0.05 can be found by the formula: t L=STUDENT.INV(0.025;12) = -2.1788 where 0.025 is half the significance level and 12 = n – 2; t U\u003d STUDENT.INV (0.975, 12) \u003d +2.1788.

Because the t-statistics = 10.64 > t U= 2.1788 (Fig. 19), null hypothesis H 0 is rejected. On the other hand, R-value for X\u003d 10.6411, calculated by the formula \u003d 1-STUDENT.DIST (D3, 12, TRUE), is approximately equal to zero, so the hypothesis H 0 is rejected again. The fact that R-value is almost zero, meaning that if there were no real linear relationship between store size and annual sales, it would be almost impossible to find it using linear regression. Therefore, there is a statistically significant linear relationship between average annual store sales and store size.

Rice. 19. Testing the hypothesis about the slope of the general population at a significance level of 0.05 and 12 degrees of freedom

ApplicationF -criteria for slope. An alternative approach to testing hypotheses about the slope of a simple linear regression is to use F-criteria. Recall that F-criterion is used to test the relationship between two variances (see details). When testing the slope hypothesis, the measure of random errors is the error variance (the sum of squared errors divided by the number of degrees of freedom), so F-test uses the ratio of the variance explained by the regression (i.e., the values SSR divided by the number of independent variables k), to the error variance ( MSE=SYX 2 ).

By definition F-statistic is equal to the mean squared deviations due to regression (MSR) divided by the error variance (MSE): F = MSR/ MSE, where MSR=SSR / k, MSE =SSE/(n– k – 1), k is the number of independent variables in the regression model. Test statistics F It has F- distribution with k and n– k – 1 degrees of freedom.

For a given significance level α, the decision rule is formulated as follows: if F > FU, the null hypothesis is rejected; otherwise, it is not rejected. Results formatted as pivot table analysis of variance are shown in fig. twenty.

Rice. 20. Table of analysis of variance to test the hypothesis of the statistical significance of the regression coefficient

Similarly t-criterion F-criteria is displayed in the table when using Analysis package(option Regression). Full results of the work Analysis package shown in fig. 4, fragment related to F-statistics - in fig. 21.

Rice. 21. Application results F- Criteria obtained using the Excel Analysis ToolPack

F-statistic is 113.23 and R-value close to zero (cell SignificanceF). If the significance level α is 0.05, determine the critical value F-distributions with one and 12 degrees of freedom can be obtained from the formula F U\u003d F. OBR (1-0.05; 1; 12) \u003d 4.7472 (Fig. 22). Because the F = 113,23 > F U= 4.7472, and R-value close to 0< 0,05, нулевая гипотеза H 0 deviates, i.e. The size of a store is closely related to its annual sales volume.

Rice. 22. Testing the hypothesis about the slope of the general population at a significance level of 0.05, with one and 12 degrees of freedom

Confidence interval containing slope β 1 . To test the hypothesis of the existence of a linear relationship between variables, you can build a confidence interval containing the slope β 1 and make sure that the hypothetical value β 1 = 0 belongs to this interval. The center of the confidence interval containing the slope β 1 is the sample slope b 1 , and its boundaries are the quantities b 1 ±t n –2 Sb 1

As shown in fig. eighteen, b 1 = +1,670, n = 14, Sb 1 = 0,157. t 12 \u003d STUDENT.OBR (0.975, 12) \u003d 2.1788. Consequently, b 1 ±t n –2 Sb 1 = +1.670 ± 2.1788 * 0.157 = +1.670 ± 0.342, or + 1.328 ≤ β 1 ≤ +2.012. Thus, the slope of the population with a probability of 0.95 lies in the range from +1.328 to +2.012 (i.e., from $1,328,000 to $2,012,000). Because these values ​​are greater than zero, there is a statistically significant linear relationship between annual sales and store area. If the confidence interval contained zero, there would be no relationship between the variables. In addition, the confidence interval means that every 1,000 sq. feet results in an increase in average sales of $1,328,000 to $2,012,000.

Usaget -criteria for the correlation coefficient. correlation coefficient was introduced r, which is a measure of the relationship between two numeric variables. It can be used to determine whether there is a statistically significant relationship between two variables. Let us denote the correlation coefficient between the populations of both variables by the symbol ρ. The null and alternative hypotheses are formulated as follows: H 0: ρ = 0 (no correlation), H 1: ρ ≠ 0 (there is a correlation). Checking for the existence of a correlation:

where r = + , if b 1 > 0, r = – , if b 1 < 0. Тестовая статистика t It has t- distribution with n - 2 degrees of freedom.

In the problem of the Sunflowers store chain r2= 0.904, and b 1- +1.670 (see Fig. 4). Because the b 1> 0, the correlation coefficient between annual sales and store size is r= +√0.904 = +0.951. Let's test the null hypothesis that there is no correlation between these variables using t- statistics:

At a significance level of α = 0.05, the null hypothesis should be rejected because t= 10.64 > 2.1788. Thus, it can be argued that there is a statistically significant relationship between annual sales and store size.

When discussing inferences about population slopes, confidence intervals and criteria for testing hypotheses are interchangeable tools. However, the calculation of the confidence interval containing the correlation coefficient turns out to be more difficult, since the form of the sampling distribution of the statistic r depends on the true correlation coefficient.

Estimation of mathematical expectation and prediction of individual values

This section discusses methods for estimating the expected response Y and predictions of individual values Y for given values ​​of the variable X.

Construction of a confidence interval. In example 2 (see above section Least square method) the regression equation made it possible to predict the value of the variable Y X. In the problem of choosing a location for a retail outlet, the average annual sales in a store with an area of ​​4000 sq. feet was equal to 7.644 million dollars. However, this estimate of the mathematical expectation of the general population is a point. to estimate the mathematical expectation of the general population, the concept of a confidence interval was proposed. Similarly, one can introduce the concept confidence interval for the mathematical expectation of the response at set value variable X:

where , = b 0 + b 1 X i– predicted value variable Y at X = X i, S YX is the mean square error, n is the sample size, Xi- the given value of the variable X, µ Y|X = Xiexpected value variable Y at X = Х i,SSX=

Analysis of formula (13) shows that the width of the confidence interval depends on several factors. At a given level of significance, an increase in the amplitude of fluctuations around the regression line, measured using the mean square error, leads to an increase in the width of the interval. On the other hand, as expected, an increase in the sample size is accompanied by a narrowing of the interval. In addition, the width of the interval changes depending on the values Xi. If the value of the variable Y predicted for quantities X, close to the average value , the confidence interval turns out to be narrower than when predicting the response for values ​​far from the mean.

Let's say that when choosing a location for a store, we want to build a 95% confidence interval for the average annual sales in all stores with an area of ​​4000 square meters. feet:

Therefore, the average annual sales volume in all stores with an area of ​​​​4,000 square meters. feet, with a 95% probability lies in the range from 6.971 to 8.317 million dollars.

Compute the confidence interval for the predicted value. In addition to the confidence interval for the mathematical expectation of the response for a given value of the variable X, it is often necessary to know the confidence interval for the predicted value. Although the formula for calculating such a confidence interval is very similar to formula (13), this interval contains a predicted value and not an estimate of the parameter. Interval for predicted response YX = Xi for a specific value of the variable Xi is determined by the formula:

Let's assume that when choosing a location for a retail outlet, we want to build a 95% confidence interval for the predicted annual sales volume in a store with an area of ​​4000 square meters. feet:

Therefore, the predicted annual sales volume for a 4,000 sq. feet, with a 95% probability lies in the range from 5.433 to 9.854 million dollars. As you can see, the confidence interval for the predicted response value is much wider than the confidence interval for its mathematical expectation. This is because the variability in predicting individual values ​​is much greater than in estimating the expected value.

Pitfalls and ethical issues associated with the use of regression

Difficulties associated with regression analysis:

  • Ignoring the conditions of applicability of the method of least squares.
  • An erroneous estimate of the conditions for applicability of the method of least squares.
  • Wrong choice of alternative methods in violation of the conditions of applicability of the least squares method.
  • Application of regression analysis without in-depth knowledge of the subject of study.
  • Extrapolation of the regression beyond the range of the explanatory variable.
  • Confusion between statistical and causal relationships.

The spread of spreadsheets and software for statistical calculations eliminated the computational problems that prevented the use of regression analysis. However, this led to the fact that regression analysis began to be used by users who do not have sufficient qualifications and knowledge. How do users know about alternative methods, if many of them do not have any the slightest idea about the conditions of applicability of the method of least squares and do not know how to check their implementation?

The researcher should not be carried away by grinding numbers - calculating the shift, slope and mixed correlation coefficient. He needs deeper knowledge. Let's illustrate this with a classic example taken from textbooks. Anscombe showed that all four datasets shown in Fig. 23 have the same regression parameters (Fig. 24).

Rice. 23. Four artificial data sets

Rice. 24. Regression analysis of four artificial data sets; done with Analysis package(click on the image to enlarge the image)

So, from the point of view of regression analysis, all these data sets are completely identical. If the analysis were over on this, we would have lost a lot useful information. This is evidenced by the scatter plots (Fig. 25) and residual plots (Fig. 26) constructed for these data sets.

Rice. 25. Scatter plots for four datasets

Scatter plots and residual plots show that these data are different from each other. The only set distributed along a straight line is set A. The plot of the residuals calculated from set A has no pattern. The same cannot be said for sets B, C, and D. The scatter plot plotted for set B shows a pronounced quadratic pattern. This conclusion is confirmed by the plot of residuals, which has a parabolic shape. The scatter plot and residual plot show that dataset B contains an outlier. In this situation, it is necessary to exclude the outlier from the data set and repeat the analysis. The technique for detecting and eliminating outliers from observations is called influence analysis. After eliminating the outlier, the result of the re-evaluation of the model may be completely different. A scatterplot plotted from data set D illustrates an unusual situation in which the empirical model is highly dependent on a single response ( X 8 = 19, Y 8 = 12.5). Such regression models need to be calculated especially carefully. So, scatter and residual plots are extremely essential tool regression analysis and should be an integral part of it. Without them, regression analysis is not credible.

Rice. 26. Plots of residuals for four datasets

How to avoid pitfalls in regression analysis:

  • Analysis of the possible relationship between variables X and Y always start with a scatterplot.
  • Before interpreting the results of a regression analysis, check the conditions for its applicability.
  • Plot the residuals versus the independent variable. This will allow to determine how the empirical model corresponds to the results of observation, and to detect violation of the constancy of the variance.
  • Use histograms, stem and leaf plots, box plots, and normal distribution plots to test the assumption of a normal distribution of errors.
  • If the applicability conditions of the least squares method are not met, use alternative methods (for example, quadratic or multiple regression models).
  • If the applicability conditions of the least squares method are met, it is necessary to test the hypothesis about the statistical significance of the regression coefficients and construct confidence intervals containing the mathematical expectation and the predicted response value.
  • Avoid predicting values ​​of the dependent variable outside the range of the independent variable.
  • Keep in mind that statistical dependencies are not always causal. Remember that correlation between variables does not mean that there is a causal relationship between them.

Summary. As shown in the block diagram (Fig. 27), the note describes a simple linear regression model, the conditions for its applicability, and ways to test these conditions. Considered t-criterion for testing the statistical significance of the slope of the regression. A regression model was used to predict the values ​​of the dependent variable. An example is considered related to the choice of a place for a retail outlet, in which the dependence of the annual sales volume on the store area is studied. The information obtained allows you to more accurately select a location for the store and predict its annual sales. In the following notes, the discussion of regression analysis will continue, as well as multiple regression models.

Rice. 27. Block diagram of a note

Materials from the book Levin et al. Statistics for managers are used. - M.: Williams, 2004. - p. 792–872

If the dependent variable is categorical, logistic regression should be applied.

Using the graphical method.
This method is used to visualize the form of communication between the studied economic indicators. To do this, a graph is plotted in a rectangular coordinate system; individual values resultant attribute Y, and along the abscissa - individual values ​​of the factor attribute X.
The set of points of the effective and factor signs is called correlation field.
Based on the correlation field, one can hypothesize (for the general population) that the relationship between all possible values ​​of X and Y is linear.

Linear regression equation has the form y = bx + a + ε
Here ε - random error(deviation, indignation).
Reasons for the existence of a random error:
1. Not including significant explanatory variables in the regression model;
2. Aggregation of variables. For example, the total consumption function is an attempt at a general expression of the totality of individual spending decisions of individuals. This is only an approximation of individual relationships that have different parameters.
3. Incorrect description of the model structure;
4. Wrong functional specification;
5. Measurement errors.
Since the deviations ε i for each particular observation i are random and their values ​​in the sample are unknown, then:
1) according to the observations x i and y i, only estimates of the parameters α and β can be obtained
2) The estimates of the parameters α and β of the regression model are, respectively, the values ​​a and b, which are random in nature, since correspond to a random sample;
Then the estimated regression equation (built from the sample data) will look like y = bx + a + ε, where e i are the observed values ​​(estimates) of the errors ε i , and and b, respectively, the estimates of the parameters α and β of the regression model that should be found.
To estimate the parameters α and β - use LSM (least squares).
System of normal equations.

For our data, the system of equations has the form:

10a + 356b = 49
356a + 2135b = 9485

Express a from the first equation and substitute it into the second equation
We get b = 68.16, a = 11.17

Regression Equation:
y = 68.16 x - 11.17

1. Parameters of the regression equation.
Sample means.



Sample variances.


standard deviation

1.1. Correlation coefficient
We calculate the indicator of closeness of communication. Such an indicator is a selective linear correlation coefficient, which is calculated by the formula:

The linear correlation coefficient takes values ​​from –1 to +1.
Relationships between features can be weak or strong (close). Their criteria are scored on the Chaddock Scale:
0.1 < r xy < 0.3: слабая;
0.3 < r xy < 0.5: умеренная;
0.5 < r xy < 0.7: заметная;
0.7 < r xy < 0.9: высокая;
0.9 < r xy < 1: весьма высокая;
In our example, the relationship between feature Y factor X is very high and direct.

1.2. Regression Equation(evaluation of the regression equation).

The linear regression equation is y = 68.16 x -11.17
The coefficients of a linear regression equation can be given economic meaning. Regression Equation Coefficient shows how many units the result will change when the factor changes by 1 unit.
The coefficient b = 68.16 shows the average change in the effective indicator (in units of y) with an increase or decrease in the value of the factor x per unit of its measurement. In this example, with an increase of 1 unit, y increases by an average of 68.16.
The coefficient a = -11.17 formally shows the predicted level of y, but only if x=0 is close to the sample values.
But if x=0 is far from the x sample values, then a literal interpretation can lead to incorrect results, and even if the regression line accurately describes the values ​​of the observed sample, there is no guarantee that this will also be the case when extrapolating to the left or to the right.
By substituting the corresponding values ​​of x into the regression equation, it is possible to determine the aligned (predicted) values ​​of the effective indicator y(x) for each observation.
The relationship between y and x determines the sign of the regression coefficient b (if > 0 - direct relationship, otherwise - inverse). In our example, the connection is direct.

1.3. elasticity coefficient.
It is undesirable to use regression coefficients (in example b) for a direct assessment of the influence of factors on the effective attribute in the event that there is a difference in the units of measurement of the effective indicator y and the factor attribute x.
For these purposes, elasticity coefficients and beta coefficients are calculated. The coefficient of elasticity is found by the formula:


It shows how many percent the effective attribute y changes on average when the factor attribute x changes by 1%. It does not take into account the degree of fluctuation of factors.
In our example, the elasticity coefficient is greater than 1. Therefore, if X changes by 1%, Y will change by more than 1%. In other words, X significantly affects Y.
Beta coefficient shows by what part of the value of its average standard deviation the value of the resulting attribute will change on average when the factor attribute changes by the value of its standard deviation with the value of the remaining independent variables fixed at a constant level:

Those. an increase in x by the value of the standard deviation of this indicator will lead to an increase in the average Y by 0.9796 of the standard deviation of this indicator.

1.4. Approximation error.
Let us evaluate the quality of the regression equation using the absolute approximation error.


Since the error is greater than 15%, this equation is not desirable to use as a regression.

1.6. Determination coefficient.
The square of the (multiple) correlation coefficient is called the coefficient of determination, which shows the proportion of the variation of the resultant attribute explained by the variation of the factor attribute.
Most often, giving an interpretation of the coefficient of determination, it is expressed as a percentage.
R2 = 0.982 = 0.9596
those. in 95.96% of cases, changes in x lead to a change in y. In other words, the accuracy of the selection of the regression equation is high. The remaining 4.04% change in Y is due to factors not taken into account in the model.

x y x2 y2 x y y(x) (y i -y cp) 2 (y-y(x)) 2 (x i -x cp) 2 |y - y x |:y
0.371 15.6 0.1376 243.36 5.79 14.11 780.89 2.21 0.1864 0.0953
0.399 19.9 0.1592 396.01 7.94 16.02 559.06 15.04 0.163 0.1949
0.502 22.7 0.252 515.29 11.4 23.04 434.49 0.1176 0.0905 0.0151
0.572 34.2 0.3272 1169.64 19.56 27.81 87.32 40.78 0.0533 0.1867
0.607 44.5 .3684 1980.25 27.01 30.2 0.9131 204.49 0.0383 0.3214
0.655 26.8 0.429 718.24 17.55 33.47 280.38 44.51 0.0218 0.2489
0.763 35.7 0.5822 1274.49 27.24 40.83 61.54 26.35 0.0016 0.1438
0.873 30.6 0.7621 936.36 26.71 48.33 167.56 314.39 0.0049 0.5794
2.48 161.9 6.17 26211.61 402 158.07 14008.04 14.66 2.82 0.0236
7.23 391.9 9.18 33445.25 545.2 391.9 16380.18 662.54 3.38 1.81

2. Estimation of the parameters of the regression equation.
2.1. Significance of the correlation coefficient.

According to Student's table with significance level α=0.05 and degrees of freedom k=7 we find t crit:
t crit = (7;0.05) = 1.895
where m = 1 is the number of explanatory variables.
If t obs > t is critical, then the obtained value of the correlation coefficient is recognized as significant (the null hypothesis asserting that the correlation coefficient is equal to zero is rejected).
Since t obl > t crit, we reject the hypothesis that the correlation coefficient is equal to 0. In other words, the correlation coefficient is statistically significant
In a paired linear regression, t 2 r = t 2 b, and then testing the hypotheses about the significance of the regression and correlation coefficients is equivalent to testing the hypothesis about the significance of the linear regression equation.

2.3. Analysis of the accuracy of determining estimates of regression coefficients.
The unbiased estimate of the variance of perturbations is the value:


S 2 y = 94.6484 - unexplained variance (a measure of the dispersion of the dependent variable around the regression line).
S y = 9.7287 - standard error of the estimate (standard error of the regression).
S a - standard deviation random variable a.


S b - standard deviation of the random variable b.

2.4. Confidence intervals for the dependent variable.
Economic forecasting based on the constructed model assumes that the pre-existing relationships of variables are preserved for the lead period as well.
To predict the dependent variable of the resultant attribute, it is necessary to know the predictive values ​​of all factors included in the model.
The predictive values ​​of the factors are substituted into the model and point predictive estimates of the indicator under study are obtained. (a + bx p ± ε)
where

Let's calculate the boundaries of the interval in which 95% of the possible values ​​of Y will be concentrated with an unlimited number of observations and X p = 1 (-11.17 + 68.16*1 ± 6.4554)
(50.53;63.44)

Individual confidence intervals forYat a given valueX.
(a + bx i ± ε)
where

x i y = -11.17 + 68.16x i ε i ymin ymax
0.371 14.11 19.91 -5.8 34.02
0.399 16.02 19.85 -3.83 35.87
0.502 23.04 19.67 3.38 42.71
0.572 27.81 19.57 8.24 47.38
0.607 30.2 19.53 10.67 49.73
0.655 33.47 19.49 13.98 52.96
0.763 40.83 19.44 21.4 60.27
0.873 48.33 19.45 28.88 67.78
2.48 158.07 25.72 132.36 183.79

With a probability of 95%, it can be guaranteed that the value of Y with an unlimited number of observations will not go beyond the limits of the found intervals.

2.5. Testing hypotheses regarding the coefficients of the linear regression equation.
1) t-statistics. Student's criterion.
Let's test the hypothesis H 0 about the equality of individual regression coefficients to zero (with the alternative H 1 is not equal) at the significance level α=0.05.
t crit = (7;0.05) = 1.895


Since 12.8866 > 1.895, the statistical significance of the regression coefficient b is confirmed (we reject the hypothesis that this coefficient is equal to zero).


Since 2.0914 > 1.895, the statistical significance of the regression coefficient a is confirmed (we reject the hypothesis that this coefficient is equal to zero).

Confidence interval for the coefficients of the regression equation.
Let us determine the confidence intervals of the regression coefficients, which, with 95% reliability, will be as follows:
(b - t crit S b; b + t crit S b)
(68.1618 - 1.895 5.2894; 68.1618 + 1.895 5.2894)
(58.1385;78.1852)
With a probability of 95%, it can be argued that the value of this parameter will lie in the found interval.
(a - t a)
(-11.1744 - 1.895 5.3429; -11.1744 + 1.895 5.3429)
(-21.2992;-1.0496)
With a probability of 95%, it can be argued that the value of this parameter will lie in the found interval.

2) F-statistics. Fisher's criterion.
The significance of the regression model is checked using the Fisher F-test, the calculated value of which is found as the ratio of the variance of the initial series of observations of the studied indicator and the unbiased estimate of the variance of the residual sequence for this model.
If the calculated value with lang=EN-US>n-m-1) degrees of freedom is greater than the tabulated value at a given significance level, then the model is considered significant.

where m is the number of factors in the model.
The assessment of the statistical significance of paired linear regression is carried out according to the following algorithm:
1. A null hypothesis is put forward that the equation as a whole is statistically insignificant: H 0: R 2 =0 at the significance level α.
2. Next, determine the actual value of the F-criterion:


where m=1 for pairwise regression.
3. Table value is determined from Fisher distribution tables for a given significance level, taking into account that the number of degrees of freedom for the total sum of squares (larger variance) is 1 and the number of degrees of freedom for the residual sum of squares (lower variance) in linear regression is n-2 .
4. If the actual value of the F-criterion is less than the table value, then they say that there is no reason to reject the null hypothesis.
Otherwise, the null hypothesis is rejected and the alternative hypothesis about the statistical significance of the equation as a whole is accepted with probability (1-α).
Table value of the criterion with degrees of freedom k1=1 and k2=7, Fkp = 5.59
Since the actual value of F > Fkp, the coefficient of determination is statistically significant (The found estimate of the regression equation is statistically reliable).

Check for Autocorrelation of Residuals.
An important prerequisite for constructing a qualitative regression model using the LSM is the independence of the values ​​of random deviations from the values ​​of deviations in all other observations. This ensures that there is no correlation between any deviations and, in particular, between adjacent deviations.
Autocorrelation (serial correlation) defined as the correlation between observed measures ordered in time (time series) or space (cross series). Autocorrelation of residuals (outliers) is commonly encountered in regression analysis when using time series data and very rarely when using cross-sectional data.
In economic tasks, it is much more common positive autocorrelation than negative autocorrelation. In most cases, positive autocorrelation is caused by a directional constant influence of some factors not taken into account in the model.
Negative autocorrelation actually means that a positive deviation is followed by a negative one and vice versa. This situation can take place if the same relationship between the demand for cold drinks and income should be considered according to seasonal data (winter-summer).
Among main causes causing autocorrelation, the following can be distinguished:
1. Specification errors. Failure to take into account any important explanatory variable in the model or the wrong choice of the form of dependence usually leads to systemic deviations of observation points from the regression line, which can lead to autocorrelation.
2. Inertia. Many economic indicators (inflation, unemployment, GNP, etc.) have a certain cyclicality associated with the undulation of business activity. Therefore, the change in indicators does not occur instantly, but has a certain inertia.
3. Web effect. In many industrial and other areas, economic indicators react to changes in economic conditions with a delay (time lag).
4. Data smoothing. Often, data for a certain long time period is obtained by averaging the data over its constituent intervals. This can lead to a certain smoothing of fluctuations that existed within the period under consideration, which in turn can cause autocorrelation.
The consequences of autocorrelation are similar to those of heteroscedasticity: conclusions on t- and F-statistics that determine the significance of the regression coefficient and the coefficient of determination may be incorrect.

Autocorrelation detection

1. Graphic method
There are a number of options for graphical definition of autocorrelation. One of them relates deviations e i to the moments of their receipt i. At the same time, either the time of obtaining statistical data or the serial number of the observation is plotted along the abscissa axis, and deviations e i (or estimates of deviations) are plotted along the ordinate axis.
It is natural to assume that if there is a certain relationship between deviations, then autocorrelation takes place. The absence of dependence will most likely indicate the absence of autocorrelation.
Autocorrelation becomes clearer if you plot e i versus e i-1 .

Durbin-Watson test.
This criterion is the best known for detecting autocorrelation.
In the statistical analysis of the regression equation on initial stage often they check the feasibility of one premise: the conditions for the statistical independence of deviations from each other. In this case, the uncorrelatedness of neighboring values ​​e i is checked.

y y(x) e i = y-y(x) e 2 (e i - e i-1) 2
15.6 14.11 1.49 2.21 0
19.9 16.02 3.88 15.04 5.72
22.7 23.04 -0.3429 0.1176 17.81
34.2 27.81 6.39 40.78 45.28
44.5 30.2 14.3 204.49 62.64
26.8 33.47 -6.67 44.51 439.82
35.7 40.83 -5.13 26.35 2.37
30.6 48.33 -17.73 314.39 158.7
161.9 158.07 3.83 14.66 464.81
662.54 1197.14

To analyze the correlation of deviations, Durbin-Watson statistics are used:

Critical values ​​d 1 and d 2 are determined on the basis of special tables for the required significance level α, the number of observations n = 9 and the number of explanatory variables m=1.
There is no autocorrelation if the following condition is true:
d1< DW и d 2 < DW < 4 - d 2 .
Without referring to the tables, we can use the approximate rule and assume that there is no autocorrelation of the residuals if 1.5< DW < 2.5. Для более надежного вывода целесообразно обращаться к табличным значениям.