Criteria restrictions. Pearson's chi-square test


Statistical tests for cross tables - Chi-square test

To obtain statistical tests for crosstabs, click the Statistics... button in the Crosstabs dialog box. The Crosstabs: Statistics dialog box will open (see Figure 11.9).

Rice. 11.9:

The checkboxes in this dialog box allow you to select one or more criteria.

    Chi-square test ( X 2)

    Correlations

    Measures of relatedness for variables related to the nominal scale

    Measures of relatedness for variables related to the ordinal scale

    Measures of relatedness for variables related to the interval scale

    Kappa coefficient ( to)

    Measure of risk

    McNemar test

    Cochran and Mantel-Haenzel statistics

These tests are discussed in the next two sections, and because the chi-square test has great importance in statistical computing, a separate section is devoted to it.

Chi-square test ( X 2)

When conducting a chi-square test, the mutual independence of two variables of the contingency table is checked and due to this indirectly the dependence of both variables is found out. Two variables are said to be mutually independent if the observed frequencies (f o) in the cells match the expected frequencies (fe).

To run a chi-square test with SPSS, follow these steps:

    Select from the command menu Analyze (Analysis) Descriptive Statistics (Descriptive statistics) Crosstabs... (Tables of contingency)

    Use the Reset button to clear possible settings.

    Move variable sex to a list of strings, and a variable psyche- to the list of columns.

    Click the button Cells...(cells). In the dialog box, in addition to the default Observed checkbox, check the Expected and Standardized checkboxes. Confirm your choice with the Continue button.

    Click the button Statistics...(Statistics). The Crosstabs: Statistics dialog box described above opens.

    Check the Chi-square checkbox. Click the Continue button, and in the main dialog box, click OK.

You will receive the following contingency table.

Gender * Mental state Contingency table

Mental condition Total
Extremely unstable unstable sustainable Very stable
Floor Female Count 16 18 9 1 44
Expected Count 7,9 16,6 17,0 2,5 44,0
Std. Residual 2,9 ,3 -1,9 -.9
Male Count 3 22 32 5 62
Expected Count 11,1 23,4 24,0 3,5 62,0
Std. Residual -2,4 -,3 1,6 ,8
Total Count 19 40 41 6 106
Expected Count 19,0 40,0 41,0 6,0 106,0

In addition, the results of the chi-square test will be shown in the viewer window:

Chi-Square Tests

value df Asymp. Sig. (2-sided)
(Asymptotic Significance (Two-tailed))
Pearson Chi-Square
(Chi-square according to Pearson)
22.455(a) 3 ,000
Likelihood Ratio
(Likelihood ratio)
23,688 3 ,000
Linear-by-Linear Association
(Dependency linear-linear)
20,391 1 ,000
N of Valid Cases
(Number of allowed cases)
106

a. 2 cells (25.0%) have expected count less than 5. The minimum expected count is 2.49

Three different approaches are used to calculate the chi-square test:

  • Pearson's formula;
  • credibility correction;
  • Mantel-Haenszel test.
  • If the cross table has four fields (2 x 2 table) and the expected probability is less than 5, additionally, Fisher's exact test.

Usually, Pearson's formula is used to calculate the chi-square test:

Here the sum of the squares of the standardized residuals over all fields of the contingency table is calculated. Therefore, fields with a higher standardized residual contribute more to the chi-square value and hence to a meaningful result. According to the rule given in section 8.9, a standardized residual of 2 (1.96) or more indicates a significant discrepancy between the observed and expected frequencies in a particular table cell.

In this example, the Pearson formula gives the most significant value of the chi-square test (p<0,0001). Если рассмотреть стандартизованные остатки в отдельных полях таблицы сопряженности, то на основе вышеприведенного правила можно сделать вывод, что эта значимость в основном определяется полями, в которых переменная psyche means "extremely unstable". In women, this value is greatly increased, and in men it is lowered.

The correctness of the The chi-square test is determined by two conditions:

  • expected frequencies< 5 должны встречаться не более чем в 20% полей таблицы;
  • row and column sums must always be greater than zero.

However, in the example under consideration, this condition is not fully satisfied. As the note after the chi-square test table indicates, 25% of the fields have an expected frequency of less than 5. However, since the allowable limit of 20% is only slightly exceeded and these fields, due to their very small standardized residual, contribute a very small proportion to the value of the chi test -square, this violation can be considered insignificant.

An alternative to Pearson's formula for computing the chi-square test is the likelihood adjustment:

With a large sample size, the Pearson formula and the corrected formula give very close results. In our example, the likelihood-adjusted chi-square test is 23.688.

The use of this criterion is based on the use of such a measure (statistics) of the discrepancy between the theoretical F(x) and empirical distribution F* P (x) , which approximately obeys the distribution law χ 2 . Hypothesis H 0 The consistency of distributions is checked by analyzing the distribution of these statistics. The application of the criterion requires the construction of a statistical series.

So, let the sample be represented by a statistical row with the number of digits M. Observed hit rate in i- th rank n i. In accordance with the theoretical distribution law, the expected frequency of hits in i-th digit is F i. The difference between the observed and expected frequency will be the value ( n iF i). To find the overall degree of discrepancy between F(x) and F* P (x) it is necessary to calculate the weighted sum of squared differences for all digits of the statistical series

χ value 2 with unlimited magnification n has a χ 2 -distribution (asymptotically distributed as χ 2). This distribution depends on the number of degrees of freedom k, i.e. the number of independent values ​​of terms in expression (3.7). The number of degrees of freedom is equal to the number y minus the number of linear links imposed on the sample. One connection exists due to the fact that any frequency can be calculated from the set of frequencies in the remaining M-1 digits. In addition, if the distribution parameters are not known in advance, then there is another limitation due to the fitting of the distribution to the sample. If the sample determines S distribution parameters, then the number of degrees of freedom will be k= MS–1.

Area of ​​acceptance of the hypothesis H 0 is determined by the condition χ 2 < χ 2 (k; a) , where χ 2 (k; a) is the critical point of the χ2-distribution with the significance level a. The probability of an error of the first kind is a, the probability of a type II error cannot be clearly defined, because there are an infinite number of different ways of mismatching distributions. The power of the test depends on the number of digits and the sample size. The criterion is recommended for n>200, application is allowed at n>40, it is under such conditions that the criterion is consistent (as a rule, it rejects an incorrect null hypothesis).

Criteria check algorithm

1. Construct a histogram in an equiprobable way.

2. By the form of the histogram, put forward a hypothesis

H 0: f(x) = f 0 (x),

H 1: f(x) ¹ f 0 (x),

where f 0 (x) is the probability density of a hypothetical distribution law (for example, uniform, exponential, normal).

Comment. The hypothesis of an exponential distribution law can be put forward if all the numbers in the sample are positive.

3. Calculate the value of the criterion using the formula

,

where
frequency of hitting i-th interval;

p i- theoretical probability of hitting a random variable in i- th interval provided that the hypothesis H 0 is correct.

Formulas for calculation p i in the case of exponential, uniform and normal laws, respectively, are equal.

Exponential Law

. (3.8)

Wherein A 1 = 0, B m = +¥.

uniform law

normal law

. (3.10)

Wherein A 1 = -¥, B M = +¥.

Remarks. After calculating all probabilities p i check if the control ratio is satisfied

Function F( X) is odd. Ф(+¥) = 1.

4. From the table "Chi-square" of the Application, a value is selected
, where a is the given significance level (a = 0.05 or a = 0.01), and k- the number of degrees of freedom, determined by the formula

k = M - 1 - S.

Here S- the number of parameters on which the chosen hypothesis depends H 0 distribution law. Values S for the uniform law it is 2, for the exponential - 1, for the normal - 2.

5. If
, then the hypothesis H 0 is rejected. Otherwise, there is no reason to reject it: with probability 1 - b it is true, and with probability - b it is false, but the value of b is unknown.

Example3 . 1. Using the criterion c 2, put forward and test a hypothesis about the law of distribution of a random variable X, a variation series, interval tables and distribution histograms of which are given in example 1.2. The significance level a is 0.05.

Solution . Based on the type of histograms, we hypothesize that the random variable X distributed according to the normal law:

H 0: f(x) = N(m, s);

H 1: f(x) ¹ N(m, s).

The criterion value is calculated by the formula:

(3.11)

As noted above, when testing a hypothesis, it is preferable to use an equiprobable histogram. In this case

Theoretical probabilities p i we calculate by formula (3.10). At the same time, we assume that

p 1 = 0.5(F((-4.5245+1.7)/1.98)-F((-¥+1.7)/1.98)) = 0.5(F(-1.427) -Ф(-¥)) =

0,5(-0,845+1) = 0,078.

p 2 = 0.5(F((-3.8865+1.7)/1.98)-F((-4.5245+1.7)/1.98)) =

0.5(F(-1.104)+0.845) = 0.5(-0.729+0.845) = 0.058.

p 3 = 0,094; p 4 = 0,135; p 5 = 0,118; p 6 = 0,097; p 7 = 0,073; p 8 = 0,059; p 9 = 0,174;

p 10 \u003d 0.5 (Ф ((+ ¥ + 1.7) / 1.98) - Ф ((0.6932 + 1.7) / 1.98)) \u003d 0.114.

After that, we check the fulfillment of the control relation

100 × (0.0062 + 0.0304 + 0.0004 + 0.0091 + 0.0028 + 0.0001 + 0.0100 +

0.0285 + 0.0315 + 0.0017) = 100 × 0.1207 = 12.07.

After that, from the table "Chi - square" we select the critical value

.

Because
then the hypothesis H 0 is accepted (no reason to reject it).

The quantitative study of biological phenomena necessarily requires the creation of hypotheses that can be used to explain these phenomena. To test this or that hypothesis, a series of special experiments is put in place and the actual data obtained are compared with those theoretically expected according to this hypothesis. If there is a match, this may be sufficient reason to accept the hypothesis. If the experimental data are in poor agreement with the theoretically expected, there is great doubt about the correctness of the proposed hypothesis.

The degree of compliance of the actual data with the expected (hypothetical) is measured by the chi-square fit test:

 the actually observed value of the feature in i- toy; - the theoretically expected number or sign (indicator) for a given group, k-number of data groups.

The criterion was proposed by K. Pearson in 1900 and is sometimes called Pearson's criterion.

A task. Among 164 children who inherited the factor from one parent and the factor from the other, there were 46 children with the factor, 50 with the factor, 68 with both. Calculate expected frequencies at a 1:2:1 ratio between groups and determine the degree of agreement between empirical data using Pearson's test.

Solution: The ratio of observed frequencies is 46:68:50, theoretically expected 41:82:41.

Let's set the significance level to 0.05. The tabular value of the Pearson test for this level of significance with the number of degrees of freedom equal to it turned out to be 5.99. Therefore, the hypothesis about the correspondence of the experimental data to the theoretical one can be accepted, since, .

Note that when calculating the chi-square test, we no longer set the condition for the indispensable normality of the distribution. The chi-square test can be used for any distributions that we are free to choose in our assumptions. There is some universality in this criterion.

Another application of Pearson's criterion is the comparison of an empirical distribution with a Gaussian normal distribution. At the same time, it can be attributed to the group of criteria for checking the normality of the distribution. The only limitation is the fact that total number values ​​(option) when using this criterion should be large enough (at least 40), and the number of values ​​in separate classes (intervals) should be at least 5. Otherwise, adjacent intervals should be combined. The number of degrees of freedom when checking the normality of the distribution should be calculated as:.

    1. Fisher's criterion.

This parametric test serves to test the null hypothesis about the equality of the variances of normally distributed populations.

Or.

For small sample sizes, the application of the Student's t-test can be correct only if the variances are equal. Therefore, before testing the equality of sample means, it is necessary to make sure that the Student's t-test is valid.

where N 1 , N 2 sample sizes, 1 , 2 - the number of degrees of freedom for these samples.

When using tables, it should be noted that the number of degrees of freedom for a sample with a larger variance is chosen as the column number of the table, and for a smaller variance, as the row number of the table.

For the significance level according to the tables of mathematical statistics, we find a tabular value. If, then the hypothesis of equality of variances is rejected for the chosen level of significance.

Example. Studied the effect of cobalt on the body weight of rabbits. The experiment was carried out on two groups of animals: experimental and control. Experienced received an additive to the diet in the form of an aqueous solution of cobalt chloride. During the experiment, weight gain was in grams:

Control

Purpose of criterion χ 2 - Pearson's criterion Criterion χ 2 is used for two purposes: 1) to compare the empirical distribution of a feature with the theoretical one - uniform, normal or some other; 2) to compare two, three or more empirical distributions of the same feature. Description of the criterion The χ 2 criterion answers the question of whether different meanings trait in empirical and theoretical distributions or in two or more empirical distributions. The advantage of the method is that it allows comparing the distributions of features presented in any scale, starting from the scale of names. In the simplest case of the alternative distribution "yes - no", "allowed marriage - did not allow marriage", "solved the problem - did not solve the problem", etc., we can already apply the criterion χ 2 . The greater the discrepancy between two comparable distributions, the greater the empirical value of χ 2 . Automatic calculation of χ 2 - Pearson's criterion To automatically calculate χ 2 - Pearson's criterion, it is necessary to perform two steps: Step 1. Specify the number of empirical distributions (from 1 to 10); Step 2. Enter the empirical frequencies in the table; Step 3. Get an answer.

The advantage of the Pearson criterion is its universality: it can be used to test hypotheses about various laws distribution.

1. Testing the hypothesis of a normal distribution.

Let a sample of a sufficiently large size be obtained P with lots of different meanings option. For the convenience of its processing, we divide the interval from the smallest to the largest of the values ​​of the variant by s equal parts and we will assume that the values ​​of the options that fall into each interval are approximately equal to the number that specifies the middle of the interval. Having counted the number of options that fell into each interval, we will make the so-called grouped sample:

options……….. X 1 X 2 … x s

frequencies…………. P 1 P 2 … n s ,

where x i are the values ​​of the midpoints of the intervals, and n i is the number of options included in i th interval (empirical frequencies).



Based on the data obtained, it is possible to calculate the sample mean and sample standard deviation σ B. Let us check the assumption that the general population is distributed according to the normal law with parameters M(X) = , D(X) = . Then you can find the number of numbers from the volume sample P, which should be in each interval under this assumption (that is, theoretical frequencies). To do this, using the table of values ​​of the Laplace function, we find the probability of hitting i-th interval:

,

where a i and b i- borders i-th interval. Multiplying the resulting probabilities by the sample size n, we find the theoretical frequencies: p i =n p i.Our goal is to compare empirical and theoretical frequencies, which, of course, differ from each other, and find out whether these differences are insignificant, do not disprove the hypothesis of the normal distribution of the random variable under study, or are they so large that they contradict this hypothesis. For this, a criterion is used in the form of a random variable

. (20.1)

Its meaning is obvious: the parts are summed up, which are the squares of the deviations of the empirical frequencies from the theoretical ones from the corresponding theoretical frequencies. It can be proved that, regardless of the real distribution law of the general population, the distribution law of the random variable (20.1) at tends to the distribution law (see lecture 12) with the number of degrees of freedom k = s - 1 – r, where r is the number of parameters of the estimated distribution estimated from the sample data. The normal distribution is characterized by two parameters, so k = s - 3. For the selected criterion, a right-handed critical region is constructed, determined by the condition

(20.2)

where α - significance level. Therefore, the critical region is given by the inequality and the acceptance area of ​​the hypothesis is .

So, to test the null hypothesis H 0: the population is normally distributed - you need to calculate the observed value of the criterion from the sample:

, (20.1`)

and according to the table of critical points of the distribution χ 2 find the critical point using known valuesα and k = s - 3. If - the null hypothesis is accepted, if it is rejected.

2. Testing the hypothesis of uniform distribution.

When using the Pearson test to test the hypothesis of a uniform distribution of the general population with an assumed probability density

it is necessary, having calculated the value from the available sample, to estimate the parameters a and b according to the formulas:

where a* and b*- estimates a and b. Indeed, for a uniform distribution M(X) = , , from where you can get a system for determining a* and b*: , whose solution is expressions (20.3).

Then, assuming that , you can find the theoretical frequencies using the formulas

Here s is the number of intervals into which the sample is divided.

The observed value of the Pearson criterion is calculated by the formula (20.1`), and the critical value is calculated from the table, taking into account the fact that the number of degrees of freedom k = s - 3. After that, the boundaries of the critical region are determined in the same way as for testing the hypothesis of a normal distribution.

3. Testing the hypothesis about the exponential distribution.

In this case, dividing the existing sample into intervals of equal length, we consider a sequence of options equidistant from each other (we assume that all options that fall into i-th interval, take a value coinciding with its middle), and their corresponding frequencies n i(number of sample options included in i– th interval). We calculate from these data and take as an estimate of the parameter λ value . Then the theoretical frequencies are calculated by the formula

Then, the observed and critical values ​​of the Pearson criterion are compared, taking into account that the number of degrees of freedom k = s - 2.

In this article, we will talk about the study of the relationship between signs, or whatever you like - random variables, variables. In particular, we will analyze how to introduce a measure of dependence between features using the Chi-square test and compare it with the correlation coefficient.

Why might this be needed? For example, in order to understand which features are more dependent on the target variable when constructing credit scoring - determining the probability of a client's default. Or, as in my case, to understand what indicators should be used to program a trading robot.

Separately, I note that for data analysis I use the c# language. Perhaps all this has already been implemented in R or Python, but using c# for me allows me to understand the topic in detail, moreover, it is my favorite programming language.

Let's start with absolutely a simple example, create four columns in excel using a random number generator:
X=RANDOMBETWEEN(-100,100)
Y =X*10+20
Z =X*X
T=RANDOMBETWEEN(-100,100)

As you can see, the variable Y linearly dependent on X; variable Z quadratically dependent on X; variables X and T independent. I made this choice on purpose, because we will compare our measure of dependence with the correlation coefficient. As you know, between two random variables it is modulo 1 if between them the most "rigid" type of dependence is linear. There is zero correlation between two independent random variables, but the independence of the correlation coefficient does not follow from the equality of the correlation coefficient. We will see this later on the example of variables. X and Z.

We save the file as data.csv and start the first estimates. First, let's calculate the correlation coefficient between the values. I did not insert the code into the article, it is on my github. We get the correlation for all possible pairs:

It can be seen that for linearly dependent X and Y the correlation coefficient is 1. But for X and Z it is equal to 0.01, although we set the dependence explicitly Z=X*X. Clearly, we need a measure that "feels" dependency better. But before moving on to the Chi-square test, let's look at what a contingency matrix is.

To build a contingency matrix, we break the range of variable values ​​into intervals (or categorize). There are many ways of such partitioning, while there is no universal one. Some of them are divided into intervals so that the same number of variables fall into them, others are divided into intervals of equal length. I personally like to combine these approaches. I decided to use this method: I subtract the score from the variable. expectations, then I divide the received by the assessment standard deviation. In other words, I center and normalize the random variable. The resulting value is multiplied by a factor (in this example it is equal to 1), after which everything is rounded up to an integer. The output is a variable of type int, which is the class identifier.

So let's take our signs X and Z, we categorize it in the way described above, after which we calculate the number and probabilities of occurrence of each class and the probabilities of occurrence of pairs of features:

This is a matrix by quantity. Here in the lines - the number of occurrences of variable classes X, in columns - the number of occurrences of variable classes Z, in cells - the number of occurrences of pairs of classes at the same time. For example, class 0 occurs 865 times for a variable X, 823 times for variable Z and never had a pair (0,0). Let's move on to probabilities by dividing all values ​​by 3000 (the total number of observations):

Received a contingency matrix obtained after categorizing features. Now it's time to think about the criterion. By definition, random variables are independent if the sigma-algebras generated by these random variables are independent. The independence of sigma-algebras implies the pairwise independence of events from them. Two events are called independent if the probability of their joint occurrence is equal to the product of the probabilities of these events: Pij = Pi*Pj. It is this formula that we will use to construct the criterion.

Null hypothesis: categorized features X and Z independent. Equivalent to it: the distribution of the contingency matrix is ​​given solely by the probabilities of the occurrence of classes of variables (the probabilities of rows and columns). Or so: the cells of the matrix are the product of the corresponding probabilities of rows and columns. We will use this formulation of the null hypothesis to construct the decision rule: a significant discrepancy between Pij and Pi*Pj will be the basis for rejecting the null hypothesis.

Let - the probability of occurrence of class 0 in the variable X. In total we have n classes X and m classes Z. It turns out that in order to set the distribution of the matrix, we need to know these n and m probabilities. But in fact, if we know n-1 probability for X, then the latter is found by subtracting the sum of the others from 1. Thus, to find the distribution of the contingency matrix, we need to know l=(n-1)+(m-1) values. Or do we have l-dimensional parametric space, the vector from which gives us our desired distribution. The chi-square statistic will look like this:

and, according to Fisher's theorem, have a Chi-squared distribution with n*m-l-1=(n-1)(m-1) degrees of freedom.

Let's set the significance level to 0.95 (or the probability of a Type I error is 0.05). Let's find the quantile of the Chi-squared distribution for the given level of significance and degrees of freedom from the example (n-1)(m-1)=4*3=12: 21.02606982. The chi-square statistic itself for the variables X and Z equals 4088.006631. It can be seen that the independence hypothesis is not accepted. It is convenient to consider the ratio of the Chi-squared statistic to the threshold value - in this case it is equal to Chi2Coeff=194.4256186. If this ratio is less than 1, then the independence hypothesis is accepted; if it is greater, then no. Let's find this ratio for all pairs of features:

Here Factor1 and factor2- feature names
src_cnt1 and src_cnt2- the number of unique values ​​of the original features
mod_cnt1 and mod_cnt2- number of unique feature values ​​after categorization
chi2- Chi-square statistics
chi2max- threshold value of the Chi-squared statistics for a significance level of 0.95
chi2Coeff- ratio of chi-square statistic to threshold value
corr- correlation coefficient

It can be seen that they are independent (chi2coeff<1) получились следующие пары признаков - (X,T), (Y,T) and ( Z,T), which is logical, since the variable T generated randomly. Variables X and Z dependent, but less than linearly dependent X and Y, which is also logical.

I posted the code of the utility that calculates these indicators on github, in the same place the data.csv file. The utility accepts a csv file as input and calculates dependencies between all pairs of columns: PtProject.Dependency.exe data.csv