Use the least squares method to find a straight line. Where is the method of least squares applied?

  • tutorial

Introduction

I am a computer programmer. I made the biggest leap in my career when I learned to say: "I do not understand anything!" Now I am not ashamed to tell the luminary of science that he is giving me a lecture, that I do not understand what it, the luminary, is talking to me about. And it's very difficult. Yes, it's hard and embarrassing to admit you don't know. Who likes to admit that he does not know the basics of something-there. By virtue of my profession, I must attend in large numbers presentations and lectures, where, I confess, in the vast majority of cases I want to sleep, because I do not understand anything. And I don’t understand because the huge problem of the current situation in science lies in mathematics. It assumes that all students are familiar with absolutely all areas of mathematics (which is absurd). To admit that you do not know what a derivative is (that this is a little later) is a shame.

But I've learned to say that I don't know what multiplication is. Yes, I don't know what a subalgebra over a Lie algebra is. Yes, I do not know why quadratic equations are needed in life. By the way, if you are sure that you know, then we have something to talk about! Mathematics is a series of tricks. Mathematicians try to confuse and intimidate the public; where there is no confusion, no reputation, no authority. Yes, it is prestigious to speak in the most abstract language possible, which is complete nonsense in itself.

Do you know what a derivative is? Most likely you will tell me about the limit of the difference relation. In the first year of mathematics at St. Petersburg State University, Viktor Petrovich Khavin me defined derivative as the coefficient of the first term of the Taylor series of the function at the point (it was a separate gymnastics to determine the Taylor series without derivatives). I laughed at this definition for a long time, until I finally understood what it was about. The derivative is nothing more than just a measure of how much the function we are differentiating is similar to the function y=x, y=x^2, y=x^3.

I now have the honor of lecturing students who fear mathematics. If you are afraid of mathematics - we are on the way. As soon as you try to read some text and it seems to you that it is overly complicated, then know that it is badly written. I argue that there is not a single area of ​​mathematics that cannot be spoken about "on fingers" without losing accuracy.

The challenge for the near future: I instructed my students to understand what a linear-quadratic controller is. Don't be shy, waste three minutes of your life, follow the link. If you do not understand anything, then we are on the way. I (a professional mathematician-programmer) also did not understand anything. And I assure you, this can be sorted out "on the fingers." On the this moment I do not know what it is, but I assure you that we will be able to figure it out.

So, the first lecture that I am going to give to my students after they come running to me in horror with the words that the linear-quadratic controller is a terrible bug that you will never master in your life is methods least squares . Can you decide linear equations? If you are reading this text, then most likely not.

So, given two points (x0, y0), (x1, y1), for example, (1,1) and (3,2), the task is to find the equation of a straight line passing through these two points:

illustration

This straight line should have an equation like the following:

Here alpha and beta are unknown to us, but two points of this line are known:

You can write this equation in matrix form:

Here we should make a lyrical digression: what is a matrix? A matrix is ​​nothing but a two-dimensional array. This is a way of storing data, no more values ​​​​should be given to it. It is up to us how exactly to interpret a certain matrix. Periodically I will interpret it as a linear mapping, periodically as quadratic form, and sometimes just as a set of vectors. This will all be clarified in context.

Let's replace specific matrices with their symbolic representation:

Then (alpha, beta) can be easily found:

More specifically for our previous data:

Which leads to the following equation of a straight line passing through the points (1,1) and (3,2):

Okay, everything is clear here. And let's find the equation of a straight line passing through three points: (x0,y0), (x1,y1) and (x2,y2):

Oh-oh-oh, but we have three equations for two unknowns! The standard mathematician will say that there is no solution. What will the programmer say? And he will first rewrite the previous system of equations in the following form:

In our case vectors i,j,b are three-dimensional, hence (in the general case) there is no solution to this system. Any vector (alpha\*i + beta\*j) lies in the plane spanned by the vectors (i, j). If b does not belong to this plane, then there is no solution (equality in the equation cannot be achieved). What to do? Let's look for a compromise. Let's denote by e(alpha, beta) how exactly we did not achieve equality:

And we will try to minimize this error:

Why a square?

We are looking not just for the minimum of the norm, but for the minimum of the square of the norm. Why? The minimum point itself coincides, and the square gives a smooth function (a quadratic function of the arguments (alpha,beta)), while just the length gives a function in the form of a cone, non-differentiable at the minimum point. Brr. Square is more convenient.

Obviously, the error is minimized when the vector e orthogonal to the plane spanned by the vectors i and j.

Illustration

In other words: we are looking for a line such that the sum of the squared lengths of the distances from all points to this line is minimal:

UPDATE: here I have a jamb, the distance to the line should be measured vertically, not orthographic projection. This commenter is correct.

Illustration

In completely different words (carefully, poorly formalized, but it should be clear on the fingers): we take all possible lines between all pairs of points and look for the average line between all:

Illustration

Another explanation on the fingers: we attach a spring between all data points (here we have three) and the straight line that we are looking for, and the straight line equilibrium state is exactly what we are looking for.

Quadratic form minimum

So, given the vector b and the plane spanned by the columns-vectors of the matrix A(in this case (x0,x1,x2) and (1,1,1)), we are looking for a vector e with a minimum square of length. Obviously, the minimum is achievable only for the vector e, orthogonal to the plane spanned by the columns-vectors of the matrix A:

In other words, we are looking for a vector x=(alpha, beta) such that:

I remind you that this vector x=(alpha, beta) is the minimum quadratic function||e(alpha, beta)||^2:

Here it is useful to remember that the matrix can be interpreted as well as the quadratic form, for example, the identity matrix ((1,0),(0,1)) can be interpreted as a function of x^2 + y^2:

quadratic form

All this gymnastics is known as linear regression.

Laplace equation with Dirichlet boundary condition

Now the simplest real problem: there is a certain triangulated surface, it is necessary to smooth it. For example, let's load my face model:

The original commit is available. To minimize external dependencies, I took the code of my software renderer, already on Habré. For solutions linear system I use OpenNL , it's a great solver, but it's really hard to install: you need to copy two files (.h+.c) to your project folder. All smoothing is done by the following code:

For (int d=0; d<3; d++) { nlNewContext(); nlSolverParameteri(NL_NB_VARIABLES, verts.size()); nlSolverParameteri(NL_LEAST_SQUARES, NL_TRUE); nlBegin(NL_SYSTEM); nlBegin(NL_MATRIX); for (int i=0; i<(int)verts.size(); i++) { nlBegin(NL_ROW); nlCoefficient(i, 1); nlRightHandSide(verts[i][d]); nlEnd(NL_ROW); } for (unsigned int i=0; i&face = faces[i]; for (int j=0; j<3; j++) { nlBegin(NL_ROW); nlCoefficient(face[ j ], 1); nlCoefficient(face[(j+1)%3], -1); nlEnd(NL_ROW); } } nlEnd(NL_MATRIX); nlEnd(NL_SYSTEM); nlSolve(); for (int i=0; i<(int)verts.size(); i++) { verts[i][d] = nlGetVariable(i); } }

X, Y and Z coordinates are separable, I smooth them separately. That is, I solve three systems of linear equations, each with the same number of variables as the number of vertices in my model. The first n rows of matrix A have only one 1 per row, and the first n rows of vector b have original model coordinates. That is, I spring-tie between the new vertex position and the old vertex position - the new ones shouldn't be too far away from the old ones.

All subsequent rows of matrix A (faces.size()*3 = the number of edges of all triangles in the grid) have one occurrence of 1 and one occurrence of -1, while the vector b has zero components opposite. This means I put a spring on each edge of our triangular mesh: all edges try to get the same vertex as their starting and ending points.

Once again: all vertices are variables, and they cannot deviate far from their original position, but at the same time they try to become similar to each other.

Here is the result:

Everything would be fine, the model is really smoothed, but it moved away from its original edge. Let's change the code a little:

For (int i=0; i<(int)verts.size(); i++) { float scale = border[i] ? 1000: 1; nlBegin(NL_ROW); nlCoefficient(i, scale); nlRightHandSide(scale*verts[i][d]); nlEnd(NL_ROW); }

In our matrix A, for the vertices that are on the edge, I add not a row from the category v_i = verts[i][d], but 1000*v_i = 1000*verts[i][d]. What does it change? And this changes our quadratic form of the error. Now a single deviation from the top at the edge will cost not one unit, as before, but 1000 * 1000 units. That is, we hung a stronger spring on the extreme vertices, the solution prefers to stretch others more strongly. Here is the result:

Let's double the strength of the springs between the vertices:
nlCoefficient(face[ j ], 2); nlCoefficient(face[(j+1)%3], -2);

It is logical that the surface has become smoother:

And now even a hundred times stronger:

What's this? Imagine that we have dipped a wire ring in soapy water. As a result, the resulting soap film will try to have the least curvature as possible, touching the same border - our wire ring. This is exactly what we got by fixing the border and asking for a smooth surface inside. Congratulations, we have just solved the Laplace equation with Dirichlet boundary conditions. Sounds cool? But in fact, just one system of linear equations to solve.

Poisson equation

Let's have another cool name.

Let's say I have an image like this:

Everyone is good, but I don't like the chair.

I cut the picture in half:



And I will select a chair with my hands:

Then I will drag everything that is white in the mask to the left side of the picture, and at the same time I will say throughout the whole picture that the difference between two neighboring pixels should be equal to the difference between two neighboring pixels of the right image:

For (int i=0; i

Here is the result:

Code and pictures are available

It is widely used in econometrics in the form of a clear economic interpretation of its parameters.

Linear regression is reduced to finding an equation of the form

or

Type equation allows for given parameter values X have theoretical values ​​of the effective feature, substituting the actual values ​​of the factor into it X.

Building linear regression comes down to estimating its parameters — a and in. Linear regression parameter estimates can be found by different methods.

The classical approach to estimating linear regression parameters is based on least squares(MNK).

LSM allows one to obtain such parameter estimates a and in, under which the sum of the squared deviations of the actual values ​​of the resultant trait (y) from calculated (theoretical) mini-minimum:

To find the minimum of a function, it is necessary to calculate the partial derivatives with respect to each of the parameters a and b and equate them to zero.

Denote through S, then:

Transforming the formula, we obtain the following system of normal equations for estimating the parameters a and in:

Solving the system of normal equations (3.5) either by the method of successive elimination of variables or by the method of determinants, we find the desired parameter estimates a and in.

Parameter in called the regression coefficient. Its value shows the average change in the result with a change in the factor by one unit.

The regression equation is always supplemented with an indicator of the tightness of the relationship. When using linear regression, the linear correlation coefficient acts as such an indicator. There are various modifications of the linear correlation coefficient formula. Some of them are listed below:

As you know, the linear correlation coefficient is within the limits: -1 1.

To assess the quality of the selection of a linear function, the square is calculated

A linear correlation coefficient called determination coefficient . The coefficient of determination characterizes the proportion of the variance of the effective feature y, explained by regression, in the total variance of the resulting trait:

Accordingly, the value 1 - characterizes the proportion of dispersion y, caused by the influence of other factors not taken into account in the model.

Questions for self-control

1. The essence of the method of least squares?

2. How many variables provide a pairwise regression?

3. What coefficient determines the tightness of the connection between the changes?

4. Within what limits is the coefficient of determination determined?

5. Estimation of parameter b in correlation-regression analysis?

1. Christopher Dougherty. Introduction to econometrics. - M.: INFRA - M, 2001 - 402 p.

2. S.A. Borodich. Econometrics. Minsk LLC "New Knowledge" 2001.


3. R.U. Rakhmetova Short course in econometrics. Tutorial. Almaty. 2004. -78s.

4. I.I. Eliseeva. Econometrics. - M.: "Finance and statistics", 2002

5. Monthly information and analytical magazine.

Nonlinear economic models. Nonlinear regression models. Variable conversion.

Nonlinear economic models..

Variable conversion.

elasticity coefficient.

If there are non-linear relationships between economic phenomena, then they are expressed using the corresponding non-linear functions: for example, an equilateral hyperbola , second degree parabolas and etc.

There are two classes of non-linear regressions:

1. Regressions that are non-linear with respect to the explanatory variables included in the analysis, but linear with respect to the estimated parameters, for example:

Polynomials of various degrees - , ;

Equilateral hyperbole - ;

Semilogarithmic function - .

2. Regressions that are non-linear in the estimated parameters, for example:

Power - ;

Demonstrative -;

Exponential - .

The total sum of the squared deviations of the individual values ​​of the resulting attribute at from the average value is caused by the influence of many factors. We conditionally divide the entire set of reasons into two groups: studied factor x and other factors.

If the factor does not affect the result, then the regression line on the graph is parallel to the axis oh and

Then the entire dispersion of the resulting attribute is due to the influence of other factors and the total sum of squared deviations will coincide with the residual. If other factors do not affect the result, then u tied With X functionally, and the residual sum of squares is zero. In this case, the sum of squared deviations explained by the regression is the same as the total sum of squares.

Since not all points of the correlation field lie on the regression line, their scatter always takes place as due to the influence of the factor X, i.e. regression at on X, and caused by the action of other causes (unexplained variation). The suitability of the regression line for the forecast depends on what part of the total variation of the trait at accounts for the explained variation

Obviously, if the sum of squared deviations due to regression is greater than the residual sum of squares, then the regression equation is statistically significant and the factor X has a significant impact on the outcome. y.

, i.e. with the number of freedom of independent variation of the feature. The number of degrees of freedom is related to the number of units of the population n and the number of constants determined from it. In relation to the problem under study, the number of degrees of freedom should show how many independent deviations from P

The assessment of the significance of the regression equation as a whole is given with the help of F- Fisher's criterion. In this case, a null hypothesis is put forward that the regression coefficient is equal to zero, i.e. b= 0, and hence the factor X does not affect the result y.

The direct calculation of the F-criterion is preceded by an analysis of the variance. Central to it is the expansion of the total sum of squared deviations of the variable at from the average value at into two parts - "explained" and "unexplained":

- total sum of squared deviations;

- sum of squared deviations explained by regression;

is the residual sum of the squares of the deviation.

Any sum of squared deviations is related to the number of degrees of freedom , i.e. with the number of freedom of independent variation of the feature. The number of degrees of freedom is related to the number of population units n and with the number of constants determined from it. In relation to the problem under study, the number of degrees of freedom should show how many independent deviations from P possible is required to form a given sum of squares.

Dispersion per degree of freedomD.

F-ratios (F-criterion):

If the null hypothesis is true, then the factor and residual variances do not differ from each other. For H 0, a refutation is necessary so that the factor variance exceeds the residual by several times. The English statistician Snedecor developed tables of critical values F-relationships at different levels of significance of the null hypothesis and a different number of degrees of freedom. Table value F-criterion is the maximum value of the ratio of variances that can occur if they diverge randomly for a given level of probability of the presence of a null hypothesis. Computed value F-relationship is recognized as reliable if o is greater than the tabular one.

In this case, the null hypothesis about the absence of a relationship of features is rejected and a conclusion is made about the significance of this relationship: F fact > F table H 0 is rejected.

If the value is less than the table F fact ‹, F table, then the probability of the null hypothesis is higher than a given level and it cannot be rejected without a serious risk of drawing the wrong conclusion about the presence of a relationship. In this case, the regression equation is considered statistically insignificant. N o does not deviate.

Standard error of the regression coefficient

To assess the significance of the regression coefficient, its value is compared with its standard error, i.e., the actual value is determined t-Student's criterion: which is then compared with the tabular value at a certain level of significance and the number of degrees of freedom ( n- 2).

Parameter Standard Error a:

The significance of the linear correlation coefficient is checked based on the magnitude of the error correlation coefficient r:

Total variance of a feature X:

Multiple Linear Regression

Model building

Multiple Regression is a regression of an effective feature with two or more factors, i.e. a model of the form

Regression can give a good result in modeling if the influence of other factors affecting the object of study can be neglected. The behavior of individual economic variables cannot be controlled, i.e., it is not possible to ensure the equality of all other conditions for assessing the influence of one factor under study. In this case, you should try to identify the influence of other factors by introducing them into the model, i.e. build a multiple regression equation: y = a+b 1 x 1 +b 2 +…+b p x p + .

The main goal of multiple regression is to build a model with a large number of factors, while determining the influence of each of them individually, as well as their cumulative impact on the modeled indicator. The specification of the model includes two areas of questions: the selection of factors and the choice of the type of regression equation

The method of least squares (LSM) allows you to estimate various quantities using the results of many measurements containing random errors.

Characteristic MNC

The main idea of ​​this method is that the sum of squared errors is considered as a criterion for the accuracy of the solution of the problem, which is sought to be minimized. When using this method, both numerical and analytical approaches can be applied.

In particular, as a numerical implementation, the least squares method implies making as many measurements of an unknown random variable as possible. Moreover, the more calculations, the more accurate the solution will be. On this set of calculations (initial data), another set of proposed solutions is obtained, from which the best one is then selected. If the set of solutions is parametrized, then the least squares method will be reduced to finding the optimal value of the parameters.

As an analytical approach to the implementation of LSM on the set of initial data (measurements) and the proposed set of solutions, some (functional) is defined, which can be expressed by a formula obtained as a certain hypothesis that needs to be confirmed. In this case, the least squares method is reduced to finding the minimum of this functional on the set of squared errors of the initial data.

Note that not the errors themselves, but the squares of the errors. Why? The fact is that often the deviations of measurements from the exact value are both positive and negative. When determining the average, simple summation can lead to an incorrect conclusion about the quality of the estimate, since the mutual cancellation of positive and negative values ​​will reduce the sampling power of the set of measurements. And, consequently, the accuracy of the assessment.

To prevent this from happening, the squared deviations are summed up. Even more than that, in order to equalize the dimension of the measured value and the final estimate, the sum of squared errors is used to extract

Some applications of MNCs

MNC is widely used in various fields. For example, in probability theory and mathematical statistics, the method is used to determine such a characteristic of a random variable as the standard deviation, which determines the width of the range of values ​​of a random variable.

We approximate the function by a polynomial of the 2nd degree. To do this, we calculate the coefficients of the normal system of equations:

, ,

Let us compose a normal system of least squares, which has the form:

The solution of the system is easy to find:, , .

Thus, the polynomial of the 2nd degree is found: .

Theoretical reference

Back to page<Введение в вычислительную математику. Примеры>

Example 2. Finding the optimal degree of a polynomial.

Back to page<Введение в вычислительную математику. Примеры>

Example 3. Derivation of a normal system of equations for finding the parameters of an empirical dependence.

Let us derive a system of equations for determining the coefficients and functions , which performs the root-mean-square approximation of the given function with respect to points. Compose a function and write the necessary extremum condition for it:

Then the normal system will take the form:

We have obtained a linear system of equations for unknown parameters and, which is easily solved.

Theoretical reference

Back to page<Введение в вычислительную математику. Примеры>

Example.

Experimental data on the values ​​of variables X and at are given in the table.

As a result of their alignment, the function

Using least square method, approximate these data with a linear dependence y=ax+b(find options a and b). Find out which of the two lines is better (in the sense of the least squares method) aligns the experimental data. Make a drawing.

The essence of the method of least squares (LSM).

The problem is to find the linear dependence coefficients for which the function of two variables a and btakes the smallest value. That is, given the data a and b the sum of the squared deviations of the experimental data from the found straight line will be the smallest. This is the whole point of the least squares method.

Thus, the solution of the example is reduced to finding the extremum of a function of two variables.

Derivation of formulas for finding coefficients.

A system of two equations with two unknowns is compiled and solved. Finding partial derivatives of functions by variables a and b, we equate these derivatives to zero.

We solve the resulting system of equations by any method (for example substitution method or Cramer's method) and obtain formulas for finding coefficients using the least squares method (LSM).

With data a and b function takes the smallest value. The proof of this fact is given below in the text at the end of the page.

That's the whole method of least squares. Formula for finding the parameter a contains the sums , , , and the parameter n is the amount of experimental data. The values ​​of these sums are recommended to be calculated separately.

Coefficient b found after calculation a.

It's time to remember the original example.

Solution.

In our example n=5. We fill in the table for the convenience of calculating the amounts that are included in the formulas of the required coefficients.

The values ​​in the fourth row of the table are obtained by multiplying the values ​​of the 2nd row by the values ​​of the 3rd row for each number i.

The values ​​in the fifth row of the table are obtained by squaring the values ​​of the 2nd row for each number i.

The values ​​of the last column of the table are the sums of the values ​​across the rows.

We use the formulas of the least squares method to find the coefficients a and b. We substitute in them the corresponding values ​​from the last column of the table:

Consequently, y=0.165x+2.184 is the desired approximating straight line.

It remains to find out which of the lines y=0.165x+2.184 or better approximates the original data, i.e. to make an estimate using the least squares method.

Estimation of the error of the method of least squares.

To do this, you need to calculate the sums of squared deviations of the original data from these lines and , a smaller value corresponds to a line that better approximates the original data in terms of the least squares method.

Since , then the line y=0.165x+2.184 approximates the original data better.

Graphic illustration of the least squares method (LSM).

Everything looks great on the charts. The red line is the found line y=0.165x+2.184, the blue line is , the pink dots are the original data.

What is it for, what are all these approximations for?

I personally use to solve data smoothing problems, interpolation and extrapolation problems (in the original example, you could be asked to find the value of the observed value y at x=3 or when x=6 according to the MNC method). But we will talk more about this later in another section of the site.

Top of page

Proof.

So that when found a and b function takes the smallest value, it is necessary that at this point the matrix of the quadratic form of the second-order differential for the function was positive definite. Let's show it.

The second order differential has the form:

That is

Therefore, the matrix of the quadratic form has the form

and the values ​​of the elements do not depend on a and b.

Let us show that the matrix is ​​positive definite. This requires that the angle minors be positive.

Angular minor of the first order . The inequality is strict, since the points do not coincide. This will be implied in what follows.

Angular minor of the second order

Let's prove that method of mathematical induction.

Conclusion: found values a and b correspond to the smallest value of the function , therefore, are the desired parameters for the least squares method.

Ever understand?
Order a Solution

Top of page

Development of a forecast using the least squares method. Problem solution example

Extrapolation - this is a method of scientific research, which is based on the dissemination of past and present trends, patterns, relationships to the future development of the object of forecasting. Extrapolation methods include moving average method, exponential smoothing method, least squares method.

Essence least squares method consists in minimizing the sum of square deviations between the observed and calculated values. The calculated values ​​are found according to the selected equation - the regression equation. The smaller the distance between the actual values ​​and the calculated ones, the more accurate the forecast based on the regression equation.

The theoretical analysis of the essence of the phenomenon under study, the change in which is displayed by a time series, serves as the basis for choosing a curve. Considerations about the nature of the growth of the levels of the series are sometimes taken into account. So, if the growth of output is expected in an arithmetic progression, then smoothing is performed in a straight line. If it turns out that the growth is exponential, then smoothing should be done according to the exponential function.

The working formula of the method of least squares : Y t+1 = a*X + b, where t + 1 is the forecast period; Уt+1 – predicted indicator; a and b are coefficients; X is a symbol of time.

Coefficients a and b are calculated according to the following formulas:

where, Uf - the actual values ​​of the series of dynamics; n is the number of levels in the time series;

The smoothing of time series by the least squares method serves to reflect the patterns of development of the phenomenon under study. In the analytic expression of a trend, time is considered as an independent variable, and the levels of the series act as a function of this independent variable.

The development of a phenomenon does not depend on how many years have passed since the starting point, but on what factors influenced its development, in what direction and with what intensity. From this it is clear that the development of a phenomenon in time appears as a result of the action of these factors.

Correctly setting the type of curve, the type of analytical dependence on time is one of the most difficult tasks of pre-predictive analysis. .

The selection of the type of function that describes the trend, the parameters of which are determined by the least squares method, is in most cases empirical, by constructing a number of functions and comparing them with each other by the value of the root-mean-square error calculated by the formula:

where Uf - the actual values ​​of the series of dynamics; Ur – calculated (smoothed) values ​​of the time series; n is the number of levels in the time series; p is the number of parameters defined in the formulas describing the trend (development trend).

Disadvantages of the least squares method :

  • when trying to describe the economic phenomenon under study using a mathematical equation, the forecast will be accurate for a short period of time and the regression equation should be recalculated as new information becomes available;
  • the complexity of the selection of the regression equation, which is solvable using standard computer programs.

An example of using the least squares method to develop a forecast

A task . There are data characterizing the level of unemployment in the region, %

  • Build a forecast of the unemployment rate in the region for the months of November, December, January, using the methods: moving average, exponential smoothing, least squares.
  • Calculate the errors in the resulting forecasts using each method.
  • Compare the results obtained, draw conclusions.

Least squares solution

For the solution, we will compile a table in which we will make the necessary calculations:

ε = 28.63/10 = 2.86% forecast accuracy high.

Conclusion : Comparing the results obtained in the calculations moving average method , exponential smoothing and the least squares method, we can say that the average relative error in calculations by the exponential smoothing method falls within 20-50%. This means that the prediction accuracy in this case is only satisfactory.

In the first and third cases, the forecast accuracy is high, since the average relative error is less than 10%. But the moving average method made it possible to obtain more reliable results (forecast for November - 1.52%, forecast for December - 1.53%, forecast for January - 1.49%), since the average relative error when using this method is the smallest - 1 ,13%.

Least square method

Other related articles:

List of sources used

  1. Scientific and methodological recommendations on the issues of diagnosing social risks and forecasting challenges, threats and social consequences. Russian State Social University. Moscow. 2010;
  2. Vladimirova L.P. Forecasting and planning in market conditions: Proc. allowance. M .: Publishing House "Dashkov and Co", 2001;
  3. Novikova N.V., Pozdeeva O.G. Forecasting the National Economy: Educational and Methodological Guide. Yekaterinburg: Publishing House Ural. state economy university, 2007;
  4. Slutskin L.N. MBA course in business forecasting. Moscow: Alpina Business Books, 2006.

MNE Program

Enter data

Data and Approximation y = a + b x

i- number of the experimental point;
x i- the value of the fixed parameter at the point i;
y i- the value of the measured parameter at the point i;
ω i- measurement weight at point i;
y i, calc.- the difference between the measured value and the value calculated from the regression y at the point i;
S x i (x i)- error estimate x i when measuring y at the point i.

Data and Approximation y = k x

i x i y i ω i y i, calc. Δy i S x i (x i)

Click on the chart

User manual for the MNC online program.

In the data field, enter on each separate line the values ​​of `x` and `y` at one experimental point. Values ​​must be separated by whitespace (space or tab).

The third value can be the point weight of `w`. If the point weight is not specified, then it is equal to one. In the overwhelming majority of cases, the weights of the experimental points are unknown or not calculated; all experimental data are considered equivalent. Sometimes the weights in the studied range of values ​​are definitely not equivalent and can even be calculated theoretically. For example, in spectrophotometry, weights can be calculated using simple formulas, although basically everyone neglects this to reduce labor costs.

Data can be pasted through the clipboard from an office suite spreadsheet, such as Excel from Microsoft Office or Calc from Open Office. To do this, in the spreadsheet, select the range of data to copy, copy to the clipboard, and paste the data into the data field on this page.

To calculate by the least squares method, at least two points are required to determine two coefficients `b` - the tangent of the angle of inclination of the straight line and `a` - the value cut off by the straight line on the `y` axis.

To estimate the error of the calculated regression coefficients, it is necessary to set the number of experimental points to more than two.

Least squares method (LSM).

The greater the number of experimental points, the more accurate the statistical estimate of the coefficients (due to the decrease in the Student's coefficient) and the closer the estimate to the estimate of the general sample.

Obtaining values ​​at each experimental point is often associated with significant labor costs, therefore, a compromise number of experiments is often carried out, which gives a digestible estimate and does not lead to excessive labor costs. As a rule, the number of experimental points for a linear least squares dependence with two coefficients is chosen in the region of 5-7 points.

A Brief Theory of Least Squares for Linear Dependence

Suppose we have a set of experimental data in the form of pairs of values ​​[`y_i`, `x_i`], where `i` is the number of one experimental measurement from 1 to `n`; `y_i` - the value of the measured value at the point `i`; `x_i` - the value of the parameter we set at the point `i`.

An example is the operation of Ohm's law. By changing the voltage (potential difference) between sections of the electrical circuit, we measure the amount of current passing through this section. Physics gives us the dependence found experimentally:

`I=U/R`,
where `I` - current strength; `R` - resistance; `U` - voltage.

In this case, `y_i` is the measured current value, and `x_i` is the voltage value.

As another example, consider the absorption of light by a solution of a substance in solution. Chemistry gives us the formula:

`A = εl C`,
where `A` is the optical density of the solution; `ε` - solute transmittance; `l` - path length when light passes through a cuvette with a solution; `C` is the concentration of the solute.

In this case, `y_i` is the measured optical density `A`, and `x_i` is the concentration of the substance that we set.

We will consider the case when the relative error in setting `x_i` is much less than the relative error in measuring `y_i`. We will also assume that all measured values ​​of `y_i` are random and normally distributed, i.e. obey the normal distribution law.

In the case of a linear dependence of `y` on `x`, we can write the theoretical dependence:
`y = a + bx`.

From a geometric point of view, the coefficient `b` denotes the tangent of the angle of inclination of the line to the `x` axis, and the coefficient `a` - the value of `y` at the point of intersection of the line with the `y` axis (for `x = 0`).

Finding the parameters of the regression line.

In an experiment, the measured values ​​of `y_i` cannot lie exactly on the theoretical line due to measurement errors, which are always inherent in real life. Therefore, a linear equation must be represented by a system of equations:
`y_i = a + b x_i + ε_i` (1),
where `ε_i` is the unknown measurement error of `y` in the `i`th experiment.

Dependence (1) is also called regression, i.e. the dependence of the two quantities on each other with statistical significance.

The task of restoring the dependence is to find the coefficients `a` and `b` from the experimental points [`y_i`, `x_i`].

To find the coefficients `a` and `b` is usually used least square method(MNK). It is a special case of the maximum likelihood principle.

Let's rewrite (1) as `ε_i = y_i - a - b x_i`.

Then the sum of squared errors will be
`Φ = sum_(i=1)^(n) ε_i^2 = sum_(i=1)^(n) (y_i - a - b x_i)^2`. (2)

The principle of the least squares method is to minimize the sum (2) with respect to the parameters `a` and `b`.

The minimum is reached when the partial derivatives of the sum (2) with respect to the coefficients `a` and `b` are equal to zero:
`frac(partial Φ)(partial a) = frac(partial sum_(i=1)^(n) (y_i - a - b x_i)^2)(partial a) = 0`
`frac(partial Φ)(partial b) = frac(partial sum_(i=1)^(n) (y_i - a - b x_i)^2)(partial b) = 0`

Expanding the derivatives, we obtain a system of two equations with two unknowns:
`sum_(i=1)^(n) (2a + 2bx_i - 2y_i) = sum_(i=1)^(n) (a + bx_i - y_i) = 0`
`sum_(i=1)^(n) (2bx_i^2 + 2ax_i - 2x_iy_i) = sum_(i=1)^(n) (bx_i^2 + ax_i - x_iy_i) = 0`

We open the brackets and transfer the sums independent of the desired coefficients to the other half, we get a system of linear equations:
`sum_(i=1)^(n) y_i = a n + b sum_(i=1)^(n) bx_i`
`sum_(i=1)^(n) x_iy_i = a sum_(i=1)^(n) x_i + b sum_(i=1)^(n) x_i^2`

Solving the resulting system, we find formulas for the coefficients `a` and `b`:

`a = frac(sum_(i=1)^(n) y_i sum_(i=1)^(n) x_i^2 - sum_(i=1)^(n) x_i sum_(i=1)^(n ) x_iy_i) (n sum_(i=1)^(n) x_i^2 — (sum_(i=1)^(n) x_i)^2)` (3.1)

`b = frac(n sum_(i=1)^(n) x_iy_i - sum_(i=1)^(n) x_i sum_(i=1)^(n) y_i) (n sum_(i=1)^ (n) x_i^2 - (sum_(i=1)^(n) x_i)^2)` (3.2)

These formulas have solutions when `n > 1` (the line can be drawn using at least 2 points) and when the determinant `D = n sum_(i=1)^(n) x_i^2 — (sum_(i= 1)^(n) x_i)^2 != 0`, i.e. when the `x_i` points in the experiment are different (i.e. when the line is not vertical).

Estimation of errors in the coefficients of the regression line

For a more accurate estimate of the error in calculating the coefficients `a` and `b`, a large number of experimental points is desirable. When `n = 2`, it is impossible to estimate the error of the coefficients, because the approximating line will uniquely pass through two points.

The error of the random variable `V` is determined error accumulation law
`S_V^2 = sum_(i=1)^p (frac(partial f)(partial z_i))^2 S_(z_i)^2`,
where `p` is the number of `z_i` parameters with `S_(z_i)` error that affect the `S_V` error;
`f` is a dependency function of `V` on `z_i`.

Let's write the law of accumulation of errors for the error of the coefficients `a` and `b`
`S_a^2 = sum_(i=1)^(n)(frac(partial a)(partial y_i))^2 S_(y_i)^2 + sum_(i=1)^(n)(frac(partial a )(partial x_i))^2 S_(x_i)^2 = S_y^2 sum_(i=1)^(n)(frac(partial a)(partial y_i))^2 `,
`S_b^2 = sum_(i=1)^(n)(frac(partial b)(partial y_i))^2 S_(y_i)^2 + sum_(i=1)^(n)(frac(partial b )(partial x_i))^2 S_(x_i)^2 = S_y^2 sum_(i=1)^(n)(frac(partial b)(partial y_i))^2 `,
because `S_(x_i)^2 = 0` (we previously made a reservation that the error of `x` is negligible).

`S_y^2 = S_(y_i)^2` - the error (variance, squared standard deviation) in the `y` dimension, assuming that the error is uniform for all `y` values.

Substituting formulas for calculating `a` and `b` into the resulting expressions, we get

`S_a^2 = S_y^2 frac(sum_(i=1)^(n) (sum_(i=1)^(n) x_i^2 - x_i sum_(i=1)^(n) x_i)^2 ) (D^2) = S_y^2 frac((n sum_(i=1)^(n) x_i^2 - (sum_(i=1)^(n) x_i)^2) sum_(i=1) ^(n) x_i^2) (D^2) = S_y^2 frac(sum_(i=1)^(n) x_i^2) (D)` (4.1)

`S_b^2 = S_y^2 frac(sum_(i=1)^(n) (n x_i - sum_(i=1)^(n) x_i)^2) (D^2) = S_y^2 frac( n (n sum_(i=1)^(n) x_i^2 - (sum_(i=1)^(n) x_i)^2)) (D^2) = S_y^2 frac(n) (D) ` (4.2)

In most real experiments, the value of `Sy` is not measured. To do this, it is necessary to carry out several parallel measurements (experiments) at one or several points of the plan, which increases the time (and possibly cost) of the experiment. Therefore, it is usually assumed that the deviation of `y` from the regression line can be considered random. The variance estimate `y` in this case is calculated by the formula.

`S_y^2 = S_(y, rest)^2 = frac(sum_(i=1)^n (y_i - a - b x_i)^2) (n-2)`.

The divisor `n-2` appears because we have reduced the number of degrees of freedom due to the calculation of two coefficients for the same sample of experimental data.

This estimate is also called the residual variance relative to the regression line `S_(y, rest)^2`.

The assessment of the significance of the coefficients is carried out according to the Student's criterion

`t_a = frac(|a|) (S_a)`, `t_b = frac(|b|) (S_b)`

If the calculated criteria `t_a`, `t_b` are less than the table criteria `t(P, n-2)`, then it is considered that the corresponding coefficient is not significantly different from zero with a given probability `P`.

To assess the quality of the description of a linear relationship, you can compare `S_(y, rest)^2` and `S_(bar y)` relative to the mean using the Fisher criterion.

`S_(bar y) = frac(sum_(i=1)^n (y_i - bar y)^2) (n-1) = frac(sum_(i=1)^n (y_i - (sum_(i= 1)^n y_i) /n)^2) (n-1)` - sample estimate of the variance of `y` relative to the mean.

To evaluate the effectiveness of the regression equation for describing the dependence, the Fisher coefficient is calculated
`F = S_(bar y) / S_(y, rest)^2`,
which is compared with the tabular Fisher coefficient `F(p, n-1, n-2)`.

If `F > F(P, n-1, n-2)`, the difference between the description of the dependence `y = f(x)` using the regression equation and the description using the mean is considered statistically significant with probability `P`. Those. the regression describes the dependence better than the spread of `y` around the mean.

Click on the chart
to add values ​​to the table

Least square method. The method of least squares means the determination of unknown parameters a, b, c, the accepted functional dependence

The method of least squares means the determination of unknown parameters a, b, c,… accepted functional dependence

y = f(x,a,b,c,…),

which would provide a minimum of the mean square (variance) of the error

, (24)

where x i , y i - set of pairs of numbers obtained from the experiment.

Since the condition for the extremum of a function of several variables is the condition that its partial derivatives are equal to zero, then the parameters a, b, c,… are determined from the system of equations:

; ; ; … (25)

It must be remembered that the least squares method is used to select parameters after the form of the function y = f(x) defined.

If from theoretical considerations it is impossible to draw any conclusions about what the empirical formula should be, then one has to be guided by visual representations, primarily a graphical representation of the observed data.

In practice, most often limited to the following types of functions:

1) linear ;

2) quadratic a .

Example.

Experimental data on the values ​​of variables X and at are given in the table.

As a result of their alignment, the function

Using least square method, approximate these data with a linear dependence y=ax+b(find options a and b). Find out which of the two lines is better (in the sense of the least squares method) aligns the experimental data. Make a drawing.

The essence of the method of least squares (LSM).

The problem is to find the linear dependence coefficients for which the function of two variables a and b takes the smallest value. That is, given the data a and b the sum of the squared deviations of the experimental data from the found straight line will be the smallest. This is the whole point of the least squares method.

Thus, the solution of the example is reduced to finding the extremum of a function of two variables.

Derivation of formulas for finding coefficients.

A system of two equations with two unknowns is compiled and solved. Finding partial derivatives of functions by variables a and b, we equate these derivatives to zero.

We solve the resulting system of equations by any method (for example substitution method or Cramer's method) and obtain formulas for finding the coefficients using the least squares method (LSM).

With data a and b function takes the smallest value. The proof of this fact is given below the text at the end of the page.

That's the whole method of least squares. Formula for finding the parameter a contains the sums ,,, and the parameter n- amount of experimental data. The values ​​of these sums are recommended to be calculated separately. Coefficient b found after calculation a.

It's time to remember the original example.

Solution.

In our example n=5. We fill in the table for the convenience of calculating the amounts that are included in the formulas of the required coefficients.

The values ​​in the fourth row of the table are obtained by multiplying the values ​​of the 2nd row by the values ​​of the 3rd row for each number i.

The values ​​in the fifth row of the table are obtained by squaring the values ​​of the 2nd row for each number i.

The values ​​of the last column of the table are the sums of the values ​​across the rows.

We use the formulas of the least squares method to find the coefficients a and b. We substitute in them the corresponding values ​​from the last column of the table:

Consequently, y=0.165x+2.184 is the desired approximating straight line.

It remains to find out which of the lines y=0.165x+2.184 or better approximates the original data, i.e. to make an estimate using the least squares method.

Estimation of the error of the method of least squares.

To do this, you need to calculate the sums of squared deviations of the original data from these lines and , a smaller value corresponds to a line that better approximates the original data in terms of the least squares method.

Since , then the line y=0.165x+2.184 approximates the original data better.

Graphic illustration of the least squares method (LSM).

Everything looks great on the charts. The red line is the found line y=0.165x+2.184, the blue line is , the pink dots are the original data.

In practice, when modeling various processes - in particular, economic, physical, technical, social - one or another method of calculating the approximate values ​​of functions from their known values ​​at some fixed points is widely used.

Problems of approximation of functions of this kind often arise:

    when constructing approximate formulas for calculating the values ​​of the characteristic quantities of the process under study according to the tabular data obtained as a result of the experiment;

    in numerical integration, differentiation, solving differential equations, etc.;

    if it is necessary to calculate the values ​​of functions at intermediate points of the considered interval;

    when determining the values ​​of the characteristic quantities of the process outside the interval under consideration, in particular, when forecasting.

If, in order to model a certain process specified by a table, a function is constructed that approximately describes this process based on the least squares method, it will be called an approximating function (regression), and the task of constructing approximating functions itself will be an approximation problem.

This article discusses the possibilities of the MS Excel package for solving such problems, in addition, methods and techniques for constructing (creating) regressions for tabularly given functions (which is the basis of regression analysis) are given.

There are two options for building regressions in Excel.

    Adding selected regressions (trendlines) to a chart built on the basis of a data table for the studied process characteristic (available only if a chart is built);

    Using the built-in statistical functions of the Excel worksheet, which allows you to get regressions (trend lines) directly from the source data table.

Adding Trendlines to a Chart

For a table of data describing a certain process and represented by a diagram, Excel has an effective regression analysis tool that allows you to:

    build on the basis of the least squares method and add to the diagram five types of regressions that model the process under study with varying degrees of accuracy;

    add an equation of the constructed regression to the diagram;

    determine the degree of compliance of the selected regression with the data displayed on the chart.

Based on the chart data, Excel allows you to get linear, polynomial, logarithmic, exponential, exponential types of regressions, which are given by the equation:

y = y(x)

where x is an independent variable, which often takes the values ​​of a sequence of natural numbers (1; 2; 3; ...) and produces, for example, a countdown of the time of the process under study (characteristics).

1 . Linear regression is good at modeling features that increase or decrease at a constant rate. This is the simplest model of the process under study. It is built according to the equation:

y=mx+b

where m is the tangent of the slope of the linear regression to the x-axis; b - coordinate of the point of intersection of the linear regression with the y-axis.

2 . A polynomial trendline is useful for describing characteristics that have several distinct extremes (highs and lows). The choice of the degree of the polynomial is determined by the number of extrema of the characteristic under study. Thus, a polynomial of the second degree can well describe a process that has only one maximum or minimum; polynomial of the third degree - no more than two extrema; polynomial of the fourth degree - no more than three extrema, etc.

In this case, the trend line is built in accordance with the equation:

y = c0 + c1x + c2x2 + c3x3 + c4x4 + c5x5 + c6x6

where the coefficients c0, c1, c2,... c6 are constants whose values ​​are determined during construction.

3 . The logarithmic trend line is successfully used in modeling characteristics, the values ​​of which change rapidly at first, and then gradually stabilize.

y = c ln(x) + b

4 . The power trend line gives good results if the values ​​of the studied dependence are characterized by a constant change in the growth rate. An example of such a dependence can serve as a graph of uniformly accelerated movement of the car. If there are zero or negative values ​​in the data, you cannot use a power trend line.

It is built in accordance with the equation:

y = cxb

where the coefficients b, c are constants.

5 . An exponential trendline should be used if the rate of change in the data is continuously increasing. For data containing zero or negative values, this kind of approximation is also not applicable.

It is built in accordance with the equation:

y=cebx

where the coefficients b, c are constants.

When selecting a trend line, Excel automatically calculates the value of R2, which characterizes the accuracy of the approximation: the closer the R2 value is to one, the more reliably the trend line approximates the process under study. If necessary, the value of R2 can always be displayed on the diagram.

Determined by the formula:

To add a trend line to a data series:

    activate the chart built on the basis of the data series, i.e., click within the chart area. The Chart item will appear in the main menu;

    after clicking on this item, a menu will appear on the screen, in which you should select the Add trend line command.

The same actions are easily implemented if you hover over the graph corresponding to one of the data series and right-click; in the context menu that appears, select the Add trend line command. The Trendline dialog box will appear on the screen with the Type tab opened (Fig. 1).

After that you need:

On the Type tab, select the required trend line type (Linear is selected by default). For the Polynomial type, in the Degree field, specify the degree of the selected polynomial.

1 . The Built on Series field lists all the data series in the chart in question. To add a trendline to a specific data series, select its name in the Built on series field.

If necessary, by going to the Parameters tab (Fig. 2), you can set the following parameters for the trend line:

    change the name of the trend line in the Name of the approximating (smoothed) curve field.

    set the number of periods (forward or backward) for the forecast in the Forecast field;

    display the equation of the trend line in the chart area, for which you should enable the checkbox show the equation on the chart;

    display the value of the approximation reliability R2 in the diagram area, for which you should enable the checkbox put the value of the approximation reliability (R^2) on the diagram;

    set the point of intersection of the trend line with the Y-axis, for which you should enable the checkbox Intersection of the curve with the Y-axis at a point;

    click the OK button to close the dialog box.

There are three ways to start editing an already built trendline:

    use the Selected trend line command from the Format menu, after selecting the trend line;

    select the Format Trendline command from the context menu, which is called by right-clicking on the trendline;

    by double clicking on the trend line.

The Format Trendline dialog box will appear on the screen (Fig. 3), containing three tabs: View, Type, Parameters, and the contents of the last two completely coincide with the similar tabs of the Trendline dialog box (Fig. 1-2). On the View tab, you can set the line type, its color and thickness.

To delete an already constructed trend line, select the trend line to be deleted and press the Delete key.

The advantages of the considered regression analysis tool are:

    the relative ease of plotting a trend line on charts without creating a data table for it;

    a fairly wide list of types of proposed trend lines, and this list includes the most commonly used types of regression;

    the possibility of predicting the behavior of the process under study for an arbitrary (within common sense) number of steps forward, as well as back;

    the possibility of obtaining the equation of the trend line in an analytical form;

    the possibility, if necessary, of obtaining an assessment of the reliability of the approximation.

The disadvantages include the following points:

    the construction of a trend line is carried out only if there is a chart built on a series of data;

    the process of generating data series for the characteristic under study based on the trend line equations obtained for it is somewhat cluttered: the desired regression equations are updated with each change in the values ​​of the original data series, but only within the chart area, while the data series formed on the basis of the old line equation trend, remains unchanged;

    In PivotChart reports, when you change the chart view or the associated PivotTable report, existing trendlines are not retained, so you must ensure that the layout of the report meets your requirements before you draw trendlines or otherwise format the PivotChart report.

Trend lines can be added to data series presented on charts such as a graph, histogram, flat non-normalized area charts, bar, scatter, bubble and stock charts.

You cannot add trendlines to data series on 3-D, Standard, Radar, Pie, and Donut charts.

Using Built-in Excel Functions

Excel also provides a regression analysis tool for plotting trendlines outside the chart area. A number of statistical worksheet functions can be used for this purpose, but all of them allow you to build only linear or exponential regressions.

Excel has several functions for building linear regression, in particular:

    TREND;

  • SLOPE and CUT.

As well as several functions for constructing an exponential trend line, in particular:

    LGRFPapprox.

It should be noted that the techniques for constructing regressions using the TREND and GROWTH functions are practically the same. The same can be said about the pair of functions LINEST and LGRFPRIBL. For these four functions, when creating a table of values, Excel features such as array formulas are used, which somewhat clutters up the process of building regressions. We also note that the construction of a linear regression, in our opinion, is easiest to implement using the SLOPE and INTERCEPT functions, where the first of them determines the slope of the linear regression, and the second determines the segment cut off by the regression on the y-axis.

The advantages of the built-in functions tool for regression analysis are:

    a fairly simple process of the same type of formation of data series of the characteristic under study for all built-in statistical functions that set trend lines;

    a standard technique for constructing trend lines based on the generated data series;

    the ability to predict the behavior of the process under study for the required number of steps forward or backward.

And the disadvantages include the fact that Excel does not have built-in functions for creating other (except linear and exponential) types of trend lines. This circumstance often does not allow choosing a sufficiently accurate model of the process under study, as well as obtaining forecasts close to reality. In addition, when using the TREND and GROW functions, the equations of the trend lines are not known.

It should be noted that the authors did not set the goal of the article to present the course of regression analysis with varying degrees of completeness. Its main task is to show the capabilities of the Excel package in solving approximation problems using specific examples; demonstrate what effective tools Excel has for building regressions and forecasting; illustrate how relatively easily such problems can be solved even by a user who does not have deep knowledge of regression analysis.

Examples of solving specific problems

Consider the solution of specific problems using the listed tools of the Excel package.

Task 1

With a table of data on the profit of a motor transport enterprise for 1995-2002. you need to do the following.

    Build a chart.

    Add linear and polynomial (quadratic and cubic) trend lines to the chart.

    Using the trend line equations, obtain tabular data on the profit of the enterprise for each trend line for 1995-2004.

    Make a profit forecast for the enterprise for 2003 and 2004.

The solution of the problem

    In the range of cells A4:C11 of the Excel worksheet, we enter the worksheet shown in Fig. four.

    Having selected the range of cells B4:C11, we build a chart.

    We activate the constructed chart and, using the method described above, after selecting the type of trend line in the Trend Line dialog box (see Fig. 1), we alternately add linear, quadratic and cubic trend lines to the chart. In the same dialog box, open the Parameters tab (see Fig. 2), in the Name of the approximating (smoothed) curve field, enter the name of the trend to be added, and in the Forecast forward for: periods field, set the value 2, since it is planned to make a profit forecast for two years ahead. To display the regression equation and the value of the approximation reliability R2 in the diagram area, enable the check boxes Show the equation on the screen and place the value of the approximation reliability (R^2) on the diagram. For better visual perception, we change the type, color and thickness of the plotted trend lines, for which we use the View tab of the Trend Line Format dialog box (see Fig. 3). The resulting chart with added trend lines is shown in fig. 5.

    To obtain tabular data on the profit of the enterprise for each trend line for 1995-2004. Let's use the equations of the trend lines presented in fig. 5. To do this, in the cells of the D3:F3 range, enter textual information about the type of the selected trend line: Linear trend, Quadratic trend, Cubic trend. Next, enter the linear regression formula in cell D4 and, using the fill marker, copy this formula with relative references to the range of cells D5:D13. It should be noted that each cell with a linear regression formula from the range of cells D4:D13 has a corresponding cell from the range A4:A13 as an argument. Similarly, for quadratic regression, the cell range E4:E13 is filled, and for cubic regression, the cell range F4:F13 is filled. Thus, a forecast was made for the profit of the enterprise for 2003 and 2004. with three trends. The resulting table of values ​​is shown in fig. 6.

Task 2

    Build a chart.

    Add logarithmic, exponential and exponential trend lines to the chart.

    Derive the equations of the obtained trend lines, as well as the values ​​of the approximation reliability R2 for each of them.

    Using the trend line equations, obtain tabular data on the profit of the enterprise for each trend line for 1995-2002.

    Make a profit forecast for the business for 2003 and 2004 using these trend lines.

The solution of the problem

Following the methodology given in solving problem 1, we obtain a diagram with added logarithmic, exponential and exponential trend lines (Fig. 7). Further, using the obtained trend line equations, we fill in the table of values ​​for the profit of the enterprise, including the predicted values ​​for 2003 and 2004. (Fig. 8).

On fig. 5 and fig. it can be seen that the model with a logarithmic trend corresponds to the lowest value of the approximation reliability

R2 = 0.8659

The highest values ​​of R2 correspond to models with a polynomial trend: quadratic (R2 = 0.9263) and cubic (R2 = 0.933).

Task 3

With a table of data on the profit of a motor transport enterprise for 1995-2002, given in task 1, you must perform the following steps.

    Get data series for linear and exponential trendlines using the TREND and GROW functions.

    Using the TREND and GROWTH functions, make a profit forecast for the enterprise for 2003 and 2004.

    For the initial data and the received data series, construct a diagram.

The solution of the problem

Let's use the worksheet of task 1 (see Fig. 4). Let's start with the TREND function:

    select the range of cells D4:D11, which should be filled with the values ​​of the TREND function corresponding to the known data on the profit of the enterprise;

    call the Function command from the Insert menu. In the Function Wizard dialog box that appears, select the TREND function from the Statistical category, and then click the OK button. The same operation can be performed by pressing the button (Insert function) of the standard toolbar.

    In the Function Arguments dialog box that appears, enter the range of cells C4:C11 in the Known_values_y field; in the Known_values_x field - the range of cells B4:B11;

    to make the entered formula an array formula, use the key combination + + .

The formula we entered in the formula bar will look like: =(TREND(C4:C11;B4:B11)).

As a result, the range of cells D4:D11 is filled with the corresponding values ​​of the TREND function (Fig. 9).

To make a forecast of the company's profit for 2003 and 2004. necessary:

    select the range of cells D12:D13, where the values ​​predicted by the TREND function will be entered.

    call the TREND function and in the Function Arguments dialog box that appears, enter in the Known_values_y field - the range of cells C4:C11; in the Known_values_x field - the range of cells B4:B11; and in the field New_values_x - the range of cells B12:B13.

    turn this formula into an array formula using the keyboard shortcut Ctrl + Shift + Enter.

    The entered formula will look like: =(TREND(C4:C11;B4:B11;B12:B13)), and the range of cells D12:D13 will be filled with the predicted values ​​of the TREND function (see Fig. 9).

Similarly, a data series is filled using the GROWTH function, which is used in the analysis of nonlinear dependencies and works exactly the same as its linear counterpart TREND.

Figure 10 shows the table in formula display mode.

For the initial data and the obtained data series, the diagram shown in fig. eleven.

Task 4

With a table of data on the receipt of applications for services by the dispatching service of a motor transport enterprise for the period from the 1st to the 11th day of the current month, the following actions must be performed.

    Obtain data series for linear regression: using the SLOPE and INTERCEPT functions; using the LINEST function.

    Retrieve a data series for exponential regression using the LYFFPRIB function.

    Using the above functions, make a forecast about the receipt of applications to the dispatch service for the period from the 12th to the 14th day of the current month.

    For the original and received data series, construct a diagram.

The solution of the problem

Note that, unlike the TREND and GROW functions, none of the functions listed above (SLOPE, INTERCEPTION, LINEST, LGRFPRIB) are regressions. These functions play only an auxiliary role, determining the necessary regression parameters.

For linear and exponential regressions built using the functions SLOPE, INTERCEPT, LINEST, LGRFPRIB, the appearance of their equations is always known, in contrast to the linear and exponential regressions corresponding to the functions TREND and GROWTH.

1 . Let's build a linear regression that has the equation:

y=mx+b

using the SLOPE and INTERCEPT functions, with the slope of the regression m being determined by the SLOPE function, and the constant term b - by the INTERCEPT function.

To do this, we perform the following actions:

    enter the source table in the range of cells A4:B14;

    the value of the parameter m will be determined in cell C19. Select from the Statistical category the Slope function; enter the range of cells B4:B14 in the known_values_y field and the range of cells A4:A14 in the known_values_x field. The formula will be entered into cell C19: =SLOPE(B4:B14;A4:A14);

    using a similar method, the value of the parameter b in cell D19 is determined. And its content will look like this: = INTERCEPT(B4:B14;A4:A14). Thus, the values ​​of the parameters m and b, necessary for constructing a linear regression, will be stored, respectively, in cells C19, D19;

    then we enter the linear regression formula in cell C4 in the form: = $ C * A4 + $ D. In this formula, cells C19 and D19 are written with absolute references (the cell address should not change with possible copying). The absolute reference sign $ can be typed either from the keyboard or using the F4 key, after placing the cursor on the cell address. Using the fill handle, copy this formula to the range of cells C4:C17. We get the desired data series (Fig. 12). Due to the fact that the number of requests is an integer, you should set the number format on the Number tab of the Cell Format window with the number of decimal places to 0.

2 . Now let's build a linear regression given by the equation:

y=mx+b

using the LINEST function.

For this:

    enter the LINEST function as an array formula into the range of cells C20:D20: =(LINEST(B4:B14;A4:A14)). As a result, we get the value of the parameter m in cell C20, and the value of the parameter b in cell D20;

    enter the formula in cell D4: =$C*A4+$D;

    copy this formula using the fill marker to the range of cells D4:D17 and get the desired data series.

3 . We build an exponential regression that has the equation:

with the help of the LGRFPRIBL function, it is performed similarly:

    in the range of cells C21:D21, enter the function LGRFPRIBL as an array formula: =( LGRFPRIBL (B4:B14;A4:A14)). In this case, the value of the parameter m will be determined in cell C21, and the value of the parameter b will be determined in cell D21;

    the formula is entered into cell E4: =$D*$C^A4;

    using the fill marker, this formula is copied to the range of cells E4:E17, where the data series for exponential regression will be located (see Fig. 12).

On fig. 13 shows a table where we can see the functions we use with the necessary cell ranges, as well as formulas.

Value R 2 called determination coefficient.

The task of constructing a regression dependence is to find the vector of coefficients m of the model (1) at which the coefficient R takes the maximum value.

To assess the significance of R, Fisher's F-test is used, calculated by the formula

where n- sample size (number of experiments);

k is the number of model coefficients.

If F exceeds some critical value for the data n and k and the accepted confidence level, then the value of R is considered significant. Tables of critical values ​​of F are given in reference books on mathematical statistics.

Thus, the significance of R is determined not only by its value, but also by the ratio between the number of experiments and the number of coefficients (parameters) of the model. Indeed, the correlation ratio for n=2 for a simple linear model is 1 (through 2 points on the plane, you can always draw a single straight line). However, if the experimental data are random variables, such a value of R should be trusted with great care. Usually, in order to obtain a significant R and reliable regression, it is aimed at ensuring that the number of experiments significantly exceeds the number of model coefficients (n>k).

To build a linear regression model, you must:

1) prepare a list of n rows and m columns containing the experimental data (column containing the output value Y must be either first or last in the list); for example, let's take the data of the previous task, adding a column called "period number", numbering the numbers of periods from 1 to 12. (these will be the values X)

2) go to menu Data/Data Analysis/Regression

If the "Data Analysis" item in the "Tools" menu is missing, then you should go to the "Add-Ins" item of the same menu and check the "Analysis Package" box.

3) in the "Regression" dialog box, set:

input interval Y;

input interval X;

output interval - the upper left cell of the interval in which the calculation results will be placed (it is recommended to place it on a new worksheet);

4) click "Ok" and analyze the results.