Lab 20

Chi-square test

 

Cross tabulation and the Pearson Chi-Square Test

    Suppose that you have noticed that a lot of psychology majors are women with many fewer men. It could be that there are just more women enrolled in the university, and so you'd expect more female psych majors than male psych majors. Or, it could be that there is something about the psychology major that attracts women (or repels men?).

    Both major and gender are categorical variables (i.e., nominal variables). And in this case, we're interested in whether there is a relationship between these two categorical variables: major and gender. The variables are measured in categories (thus, categorical variables). These two things put us at the bottom left of the Decision Tree diagram:



    1) We're looking for a relationship, and (2) we have categorical (nominal, not interval/ratio) data. If we look at the bottom of the chart, these things will lead us to the Chi-Square test.

    One part of this test is a crosstabulation. Crosstabulation is a statistical technique used to display a breakdown of the data by these two variables (that is, it is a table that has displays the frequency of different majors broken down by gender).

    The Pearson chi-square test of indenpendence essentially tells us whether the results of a crosstabulation are statistically significant. That is, are the two categorical variables independent (unrelated) of one another? So basically, the chi square test is a kind of correlation test for categorical variables.

    • A chi-square will be significant if the residuals (the differences between observed frequencies and expected frequencies) for one level of a variable differ as a function of another variable.
    • The chi-square value does not tell us the nature of the differences
    So for our example, the chi-square test will tell us whether there are more female psychology majors than you would expect by chance (based on total number of males and females and total number of people in different majors).

    The Chi-Square Formula

    When do we use these methods?
    • When we have categorical variables
      • Do the percentages match up with how we thought they would?
      • Are two (or more) categorical variables independent?

    Hypothesis Testing with Chi-square

    • We test the null hypothesis that nothing interesting is happening (i.e., there is no relationship) versus alternative hypothesis that findings are interesting (i.e., there is a relationship).
    • The null hypothesis can only be rejected if there is a .05 or lower probability that our findings are due to chance
      Hypothesis tests determine the extent to which our findings may be due to chance

      Example

      A manufacturer of watches takes a sample of 200 people. Each person is classified by age and watch type preference (digital vs. analog). The question: is there a relationship between age and watch preference?

      Setup our data in a "cross tabulation" of our two variables. The data are observed frequencies (fo).



      Watch preference


      digital analog undecided
      Age under 30 90 40 10
      over 30 10 40 10

      Step 1: State the hypotheses and select an alpha level

        H0: In the population, preference is independent of (NOT related to) age
        Ha: In the population, preference is related to age
        We'll set a = 0.05
      Step 2:
      • Compute your degrees of freedom
          df = (#Columns - 1) * (#Rows - 1)
      • Go to Chi-square statistic table and find the critical value
          For this example, with df = 2, and a = 0.05 the critical chi-squared value is 5.99
      Step 3: Collect your data and compute your test statistic
        Part 1: Obtain row and column totals, also called the marginals (in blue).



        Watch preference


        digital analog undecided
        Age under 30 90 40 10 140
        over 30 10 40 10 60


        100 80 20

        Part 2: Compute the expected frequencies

        For people under 30

        • prefering digital watches: fe = (100*140)/200 = 70
        • prefering analog watches: fe = (80*140)/200 = 56
        • undecided watches: fe = (20*140)/200 = 14

        For people over 30

        • prefering digital watches: fe = (100*60)/200 = 30
        • prefering analog watches: fe = (80*60)/200 = 24
        • undecided watches: fe = (20*140)/60 = 6

      So let's enter the predicted (expected) values (in green) into our crosstabulation.



      Watch preference


      digital analog undecided
      Age under 30 90
      70
      40
      56
      10
      14
      140
      over 30 10
      30
      40
      24
      10
      6
      60


      100 80 20

      Part 3: Compute the Chi-squared statistic

      • Find the residuals (fo - fe) for each cell
      • Square these differences
      • Divide the squared differences by fe
      • Sum the results

        So then add them up

      Step 4: Compare this computed statistic (38.09) against the critical value (5.99) and make a decision about your hypotheses
df=(rows-1)*(columns-1) = (3-1)*(2-1) = 2*1 = 2
The Excel function CHIINV gives us the critical value of
χ2.
χ2 Critical = CHIINV(α,df) =  CHIINV(.05, 2) = 5.99
The Excel function CHIDIST gives us the p-value of χ2.
p = CHIDIST(χ2 obtained,df) = CHIDIST(38.09,2) = 0.0000000054

      • here we reject the H0 and conclude that there is a relationship between age and watch preference





    Computing Crosstabs and Chi-squared in SPSS

      Choose Analyze, Descriptive Statistics, Crosstabs
      Select your categorical variables
        put one in Row and the other in Column

      Click on the Statistics button and then check the chi-square option.


      Expected Counts

        Expected counts are based on marginal percentages
        Multiply the marginal percentages together to get the expected percentage for that cell, then multiply by N to get expected counts
        Or, have SPSS compute them -- Choose Cells, then check Expected.

      Residuals

        Difference between expected and observed counts
        Choose Cells, then check Unstandardized in the Residuals box.
        Standardized Residuals are distributed as z-scores (they were divided by the standard deviation of the residuals)


      Output:


      Here is some sample output looking at a crosstab of final grade and review session attendance from the students.sav file.
      • Crosstab shows frequencies of one variable for each level of the other
      • Count refers to the observed frequencies (from the data)
      • expected counts are the expected frequencies
      Output shows Pearson chi-square and "Asymp. Sig." (significance level) for the crosstab above.
      If "Asymp. Sig." is less than .05 then the residuals differ as a function of the independent variable
      • So here the chi square is not significant (sig is greater than a = 0.05), so we would fail to reject the H0. This means that we are not rejecting the hypothesis that final grade and review session attendance are independent (in other words, there is not a relationship between the two variables).

    For some of the questions in the lab, you will need this data file students.sav.

    Lab 24 Worksheet
    Email to your GA when finished.
    Use any extra time to complete your homeworks and your project.
    Every computer lab on campus has SPSS. Here is a complete list of them.