Problem from the Literature: Government Spending

In this exercise, we will focus on the research article by Theodore J. Eismeier, “Public Preferences About Government Spending: Partisan, Social, and Attitudinal Sources of Policy Differences.”Political Behavior, Volume 4, No. 2, 1982, 133-145.

The data for this problem is GovernmentSpending.Sav, which contains the data recoded to meet the analytic requirements of the article. As well as the variables included in the discriminant analyses, the data set includes the variables used to create the tables of descriptive
statistics.

Stage One: Define the Research Problem

In this stage, the following issues are addressed:

Relationship to be analyzed
Specifying the dependent and independent variables
Method for including independent variables

Relationship to be analyzed

The purpose of this study is to find what partisan, socioeconomic, and attitudinal factors are associated with support for government spending in various sectors.

Specifying the dependent and independent variables

The article incorporates seven dependent variables which represent various areas of government spending. Each dependent variable is the target of a separate analysis:

NATSPAC “Space exploration program”
NATENVIR “Improving & protecting environment”
NATHEAL “Improving & protecting nations health”
NATEDUC “Improving nations education system”
NATFARE “Welfare”
NATARMS “Military, armaments, and defense”
NATAID “Foreign aid”

Each dependent variable is a nonmetric variable that has three categories: 1 for “Spending too little”, 2 for “Spending about right”, and 3 for “Spending too much”.

In this exercise, we will use NATHEAL, “Improving & protecting nations health” , as the dependent variable.

The independent variables are age, education, race, income, Democratic party identification, Republican party identification, confidence in government, personal financial situation worsening, and belief that federal taxes are too high.

The independent variables: race, democrat, republican, worsening financial situation, and taxes too high are nonmetric and have been already been converted to dummy-coded variables where necessary.

The independent variables: age, education, income, and confidence in government (a scaled variable) will be treated as metric variables.

Method for including independent variables

If we view the author’s intention as an interest in the role which these different factors play in attitude toward government spending, we want to see the results for all of the independent variables, so we use direct entry of all variables as our method of selection.

Stage 2: Develop the Analysis Plan: Sample Size Issues

In this stage, the following issues are addressed:

Missing data analysis
Minimum sample size requirement: 20+ cases per independent variable
Division of the sample: 20+ cases in each dependent variable group

Missing data analysis

In the missing data analysis, we are looking for a pattern or process whereby the pattern of missing data could influence the results of the statistical analysis.

Run the MissingDataCheck Script

governmentspendingdiscriminantanalsysi_html_2d403125

Complete the ‘Check for Missing Data’ Dialog Box

governmentspendingdiscriminantanalsysi_html_4ffa0afb

Number of Valid and Missing Cases per Variable

Three variables have over 100 missing cases: total family income, confidence in government, and spending on space exploration. However, because of the large sample size, all variables have valid data for 90% or more of cases, so no variables will be excluded for an excessive number of missing cases.

governmentspendingdiscriminantanalsysi_html_m16e8712d

Frequency of Cases that are Missing Variables

Next, we examine the number of missing variables per case. Of the possible 16 variables in the missing data analysis (9 independent variables and 7 dependent variables), four cases were missing 8 or more variables, so we will exclude them from the analysis, reducing the sample size from 1468 to 1464.

Number of Valid and Missing Cases after Removing Four Cases

After removing the four cases missing data for 50% or more of the variables, the number of valid cases for each variable is shown in the table below.

governmentspendingdiscriminantanalsysi_html_5627bbad

Distribution of Patterns of Missing Data

governmentspendingdiscriminantanalsysi_html_2723862c

Correlation Matrix of Valid/Missing Dichotomous Variables

Inspection of the correlation matrix of valid/missing cases (not shown) reveals a single correlation close to the moderate range (0.394). This correlation is between two dependent variables, spending on military and spending on foreign aid, which will not be included in the same
analysis. All other correlations are in the weak or very weak range, so we can delete missing cases without fear that we are distorting the solution.

Minimum sample size requirement: 20+ cases per independent variable

The ratio of 1464 cases in the analysis to 9 independent variables is so large (163 to 1) that we will skip the more precise calculation taking into account the number of cases that will be missing in the analysis of each dependent variable.

Division of the sample: 20+ cases in each dependent variable group

To compute the number of cases in each dependent variable group, we run a frequency distribution for that dependent variable. In the output, we see that the minimum group size is 112 cases, so we meet this requirement.

governmentspendingdiscriminantanalsysi_html_2209a1f

Stage 2: Develop the Analysis Plan: Measurement Issues:

In this stage, the following issues are addressed:

Incorporating nonmetric data with dummy variables
Representing curvilinear effects with polynomials
Representing interaction or moderator effects

Incorporating Nonmetric Data with Dummy Variables

Dummy coding for all nonmetric variables was completed when the data set was created.

Representing Curvilinear Effects with Polynomials

We do not have any evidence of curvilinear effects at this point in the analysis.

Representing Interaction or Moderator Effects

We do not have any evidence at this point in the analysis that we should add interaction or moderator variables.

Stage 3: Evaluate Underlying Assumptions

In this stage, the following issues are addressed:

Nonmetric dependent variable and metric or dummy-coded independent variables
Multivariate normality of metric independent variables: assess normality of individual variables
Linear relationships among variables
Assumption of equal dispersion for dependent variable groups

Nonmetric dependent variable and metric or dummy-coded independent variables

The dependent variable is nonmetric. All of the independent variables are metric or dichotomous dummy-coded variables.

Multivariate normality of metric independent variables

Since there is not a method for assessing multivariate normality, we assess the normality of the individual metric variables.

Run the ‘NormalityAssumptionAndTransformations’ Script

governmentspendingdiscriminantanalsysi_html_m786c4a44

Complete the ‘Test for Assumption of Normality’ Dialog Box

Tests of Normality

We find that all of the independent variables fail the test of normality, and that none of the transformations induced normality in any variable. We should note the failure to meet the normality assumption for possible inclusion in our discussion of findings.

Linear relationships among variables

Since our dependent variable is not metric, we cannot use it to test for linearity of the independent variables. As an alternative, we can plot each metric independent variable against all other independent variables in a scatterplot matrix to look for patterns of nonlinear relationships. If one of the independent variables shows multiple nonlinear relationships to the other independent variables, we consider it a candidate for transformation

Requesting a Scatterplot Matrix

Specifications for the Scatterplot Matrix

governmentspendingdiscriminantanalsysi_html_m48a481a1

The Scatterplot Matrix

Blue fit lines were added to the scatterplot matrix to improve interpretability.

None of the scatterplots show evidence of any nonlinear relationships.

governmentspendingdiscriminantanalsysi_html_319b0bf8

Assumption of equal dispersion for dependent variable groups

Box’s M tests for homogeneity of dispersion matrices across the subgroups of the dependent variable. The null hypothesis is that the dispersion matrices are homogenous. If the analysis fails this test, we can request classification using separate group dispersion matrices in the classification phase of the discriminant analysis to see it this improves our accuracy rate.

Box’s M test is produced by the SPSS discriminant procedure, so we will defer this question until we have obtained the discriminant analysis output.

Stage 4: Estimation of Discriminant Functions and Overall Fit: The Discriminant Functions

In this stage, the following issues are addressed:

Compute the discriminant analysis
Overall significance of the discriminant function(s)

Compute the discriminant analysis

The steps to obtain a discriminant analysis are detailed on the following screens.

Requesting a Discriminant Analysis

governmentspendingdiscriminantanalsysi_html_m1bb7138b

Specifying the Dependent Variable

governmentspendingdiscriminantanalsysi_html_m519e451f

Specifying the Independent Variables

governmentspendingdiscriminantanalsysi_html_m18087a04

Specifying Statistics to Include in the Output

governmentspendingdiscriminantanalsysi_html_m62846bbc

Specifying the Direct Entry Method for Selecting Variables

governmentspendingdiscriminantanalsysi_html_m3da383c8

Specifying the Classification Options

governmentspendingdiscriminantanalsysi_html_5583ad03

Complete the Discriminant Analysis Request

governmentspendingdiscriminantanalsysi_html_101eaf6a

Overall significance of the discriminant function(s)

The output to determine the overall statistical significance of the discriminant functions is shown below. As we can see in the Wilks’ Lambda table, SPSS reports two statistically significant functions, with probabilities less than 0.05. Based on the Wilks’ Lambda tests, we would conclude that there is a statistically significant relationship between the independent and dependent variables, and there are two statistically significant discriminant functions.

The canonical correlation values of .276 for the first function and .130 for the second function match the values of .28 and .13 in the article.

governmentspendingdiscriminantanalsysi_html_5e93580c

Our conclusion from this output is that there are two statistically significant discriminant functions for this problem.

Stage 4: Estimation of Discriminant Functions and Overall Fit: Assessing Model Fit

In this stage, the following issues are addressed:

Assumption of equal dispersion for dependent variable groups
Classification accuracy by chance criteria
Press’s Q statistic
Presence of outliers

Assumption of equal dispersion for dependent variable groups

In discriminant analysis, the best measure of overall fit is classification accuracy. The appropriateness of using the pooled covariance matrix in the classification phase is evaluated by the Box’s M statistic.

For this problem, Box’s M statistic is statistically significant, so we conclude that the dispersion of our two groups is not homogeneous.

governmentspendingdiscriminantanalsysi_html_220b6dd1

Since we failed this test, we will re-run the analysis using separate covariance matrices in classification and see if this improves our overall accuracy rate. We will compare the results of the modified analysis to the cross-validated accuracy rate of 57.2% obtained for this analysis, as shown in the following table.

governmentspendingdiscriminantanalsysi_html_d131eb4

Re-running the Discriminant Analysis using separate-groups covariance matrices

governmentspendingdiscriminantanalsysi_html_m748aa8e

Requesting classification using separate-groups covariance matrices

governmentspendingdiscriminantanalsysi_html_m5a74ec34

Results of classification using separate-groups covariance matrices

The results of the classification using separate covariance matrices improves the accuracy of the model from 57.2% to 57.6%. This is equivalent to an improvement of 1% (.6/57.2). This is below the usual 10% improvement criteria that we require for a model with greater complexity and additional interpretive burden. We will revert to the model using pooled, or within-groups, covariance matrices for the classification phase of the analysis since we do not gain anything in the separate covariance model.

governmentspendingdiscriminantanalsysi_html_8296c92

Classification accuracy by chance criteria

The Classification Results table is shown again for the model using pooled covariance matrices for classification.

As shown below, the classification accuracy for our analysis is 57.2% (using the cross-validated accuracy rate). The accuracy rate is not reported in the article.

governmentspendingdiscriminantanalsysi_html_d131eb4-1

Using hand calculations I computed the proportional by chance accuracy rate to be 0.459 (.581^2 + .339^2 + .080^2) using the proportion of cases reported in each group in the table of Prior Probabilities for Groups. A 25% increase over chance results in a benchmark of 0.574. I would interpret the cross-validated classification accuracy rate, 57.2%, as satisfying the proportional by chance accuracy criteria.

governmentspendingdiscriminantanalsysi_html_m2fec0232

Since one of our groups, “Too Little,” makes up 58% of our sample in a three-group problem, it is appropriate to apply the maximum chance criteria, which I compute to be 0.726. We do not meet this criteria, so we should be cautious in generalizing the results of this model.

We should also note that much of the model’s accuracy was accomplished by predicting a large proportion of each group to be members of the dominant “Too Little” group. None of the “Too Much” cases were predicted accurately and only 12% of the “About Right” group was accurately predicted. Although we found two statistically significant discriminant functions separating the three groups, in fact, they were only achieving slight differentiation of the ‘Too little’ and ‘About right’ groups.

Press’s Q statistic

Substituting the parameters for this problem into the formula for Press’s Q, we obtain [1173-(675×3)] ^2 / (1173x(3-1)) = 309.4, which exceeds the critical value of 6.63. According to this statistic, our prediction accuracy is greater than expected by chance. However, as the text notes on page 205, this test is sensitive to sample size in much the same way that chi square values are affected by orders of magnitude (powers of ten, i.e. 10, 100, 1000, etc.). While this statistic is significant, the significance is a consequence of sample size as much as effect size. This is supported by the lack of relationship signified by the limited predictive accuracy.

Presence of outliers

SPSS print Mahalanobis distance scores for each case in the table of Casewise Statistics, so we can use this as a basis for detecting outliers.

We can request this figure from SPSS using the following compute command:
COMPUTE mahcutpt = IDF.CHISQ(0.99,9). EXECUTE.

Where 0.99 is the cumulative probability up to the significance level of interest and 9 is the number of degrees of freedom. SPSS will create a column of values in the data set that contains the desired value.

We scan the table of Casewise Statistics to identify any cases that have a Squared Mahalanobis distance greater than 21.666 for the group to which the case is most likely to belong, i.e. under the column labeled ‘Highest Group.’

Scanning the Casewise Statistics for the Original sample (not shown), I do not find any cases which have a D2 value this large for the highest classification group. The largest value I found was 16.165.

Stage 5: Interpret the Results

In this section, we address the following issues:

Number of functions to be interpreted
Relationship of functions to categories of the dependent variable
Assessing the contribution of predictor variables
Impact of multicollinearity on solution

Number of functions to be interpreted

As indicated previously, there are two significant discriminant functions to be interpreted.

Role of functions in differentiating categories of the dependent variable

The combined-groups scatterplot enables us to link the discriminant functions to the categories of the dependent variable. I have modified the SPSS output by changing the symbols for the different points so that we can detect the group members on a black and white page. In addition, I have added reference lines at the zero value for each axis.

governmentspendingdiscriminantanalsysi_html_2d4da8cf

Analyzing this plot, we see that the first function differentiates the ‘Too Little’ group from ‘About Right’ and ‘Too Much’ groups. The second function differentiates the ‘Too Much’ group from the ‘About Right’ group.

Assessing the contribution of predictor variables

Identifying the statistically significant predictor variables

When we do direct entry of all the independent variables, we do not get a statistical test of the significance of the contribution of each individual variable. While we could run the stepwise procedure to obtain these tests, we can also look to the structure matrix for an indication of the importance of variables.

Importance of Variables and the Structure Matrix

The Structure Matrix is a matrix of the correlations between the individual predictors and the discriminant functions. These correlations are also referred to as the discriminant loadings and can be interpreted like factor loadings in assessing the relative contribution of each independent variable to the discriminant function. While there is not a consensus on the size of loading required for interpretation, Tabachnick and Fidell state “By convention, correlations in excess of 0.33 (10% of variance) may be considered eligible while lower ones are not.” (page 540).

Following this guideline, we would identify three variables as important to the first discriminant function: REPUBLIC ‘Republican party identification’, DEMOCRAT ‘Democratic party identification’, and RACE ‘Race of respondent’.

We would identify CONFIDEN ‘Confidence in government’ and FINWORSE ‘Personal financial Conditions is getting worse’ as important to the second discriminant function

governmentspendingdiscriminantanalsysi_html_m428d6cf7

Note that our purpose in examining the structure matrix in discriminant analysis is not the same as our purpose in factor analysis, so we do not have the concern with simple structure in identifying important predictor variables that we had with factor analysis. A variable can play multiple roles in the discriminant functions, e.g. higher than average scores can be associated with membership is one group, while lower than average scores can be associated with membership in another group.

Comparing Group Means to Determine Direction of Relationships

We can examine the pattern of means on the significant independent for the three groups of the dependent variable to identify the role of the independent variables in predicting group membership. The following table contains an extract of the SPSS output of the group statistics.

governmentspendingdiscriminantanalsysi_html_1543200e

The first discriminant function distinguishes the groups ‘Too Little’ from ‘About Right’ and ‘Too Much’. Persons choosing the ‘Too Little’ response were more likely to be Black (12% versus 5% of the About Right group and 2% of the ‘Too Much’ group). Similarly, they were more likely to identify with the Democratic party (60% versus 46% and 22%) and less likely to identify with the Republican party (24% versus 36% and 65%).

The second discriminant function distinguishes the groups ‘About Right’ from ‘Too Little’ and ‘Too Much.’ The ‘About Right’ group had a higher average score on confidence in government (2.76 to 2.55 and 2.15) and had a lower proportion of respondents who thought that their personal financial situation was getting worse (20% versus 25% and 27%)

Impact of Multicollinearity on solution

Multicollinearity is indicated by SPSS for discriminant analysis by very small tolerance values for variables, e.g. less than 0.10 (0.10 is the size of the tolerance, not its significance value).

When we request direct entry of all independent variables, SPSS does not print out tolerance values as it does for stepwise entry. However, if a variable is collinear with another independent variable, SPSS will not enter it into the discriminant functions, but will instead print out a table in the output like the following:

governmentspendingdiscriminantanalsysi_html_7a17a04

To force SPSS to print, this table I created a duplicate variable named DUP that had the same values as another independent variable.

Since we do not find a table like this in our output, we can conclude that multicollinearity is not a problem in this analysis.

Stage 6: Validate The Model

In this stage, we are normally concerned with the following issues:

Conducting the Validation Analysis
Generalizability of the Discriminant Model

Conducting the Validation Analysis

To validate the discriminant analysis, we can randomly divide our sample into two groups, a screening sample and a validation sample. The analysis is computed for the screening sample and used to predict membership on the dependent variable in the validation sample. If the model is valid, we would expect that the accuracy rates for both groups would be about the same.

In the double cross-validation strategy, we reverse the designation of the screening and validation sample and re-run the analysis. We can then compare the discriminant functions derived for both samples. If the two sets of functions contain a very different set of variables, it indicates that the variables might have achieved significance because of the sample size and not because of the strength of the relationship. Our findings about these individual variables would that the predictive utility of these predictors is not generalizable.

Set the Starting Point for Random Number Generation

governmentspendingdiscriminantanalsysi_html_51c1b3e7

Compute the Variable to Randomly Split the Sample into Two Halves

governmentspendingdiscriminantanalsysi_html_55a0266a

Specify the Cases to Include in the First Screening Sample

governmentspendingdiscriminantanalsysi_html_588988df

Specify the Value of the Selection Variable for the First Validation Analysis

governmentspendingdiscriminantanalsysi_html_m47d8430e

Specify the Value of the Selection Variable for the Second Validation Analysis

governmentspendingdiscriminantanalsysi_html_250336ca

Generalizability of the Discriminant Model

We base our decisions about the generalizability of the discriminant model on a table which compares key outputs comparing the analysis with the full data set to each of the validation runs.

	Full Model	Split=0	Split=1
Number of Significant Functions	2	1	1
Cross-validated Accuracy for Screening Sample	57.2%	57.0%	57.8%
Accuracy Rate for Validation Sample		58.2%	56.5%
Important Variables from Structure Matrix	REPUBLIC Republican party identification DEMOCRAT Democratic party identification RACE Race of respondent CONFIDEN Confidence in government FINWORSE Personal financial Conditions is getting worse	SPSS output is for two functions, not the one significant function	SPSS output is for two functions, not the one significant function

The accuracy rate of the model is maintained throughout the validation analysis, suggesting that this is a correct assessment of our ability to predict attitude toward health spending.

However, a second discriminant function was not found in either of the validation analyses, suggesting that the significance of the second function in the model of the full data set was associated with the larger sample size in that analysis.

The failure to validate the second function implies that the data only supports the existence of the first discriminant function which differentiates the group that feels we are spending ‘Too Little’ on health care from other respondents. In addition the accuracy of the model is based on it classification of all cases in the ‘Too Little’ group. The accuracy rate of the discriminant model could be achieved by chance alone.

In sum, our ability to differentiate preferences for health spending is limited by the fact that most respondents to the survey believe we should spend more on health care. The independent variables available in this analysis are not good discriminators of different preferences for health spending.

Stage One: Define the Research Problem

Relationship to be analyzed

Specifying the dependent and independent variables

Method for including independent variables

Stage 2: Develop the Analysis Plan: Sample Size Issues

Missing data analysis

Run the MissingDataCheck Script

Complete the ‘Check for Missing Data’ Dialog Box

Number of Valid and Missing Cases per Variable

Frequency of Cases that are Missing Variables

Number of Valid and Missing Cases after Removing Four Cases

Distribution of Patterns of Missing Data

Correlation Matrix of Valid/Missing Dichotomous Variables

Minimum sample size requirement: 20+ cases per independent variable

Division of the sample: 20+ cases in each dependent variable group

Stage 2: Develop the Analysis Plan: Measurement Issues:

Stage 3: Evaluate Underlying Assumptions

Run the ‘NormalityAssumptionAndTransformations’ Script

Complete the ‘Test for Assumption of Normality’ Dialog Box

Tests of Normality

Linear relationships among variables

Requesting a Scatterplot Matrix

Specifications for the Scatterplot Matrix

The Scatterplot Matrix

Stage 4: Estimation of Discriminant Functions and Overall Fit: The Discriminant Functions

Specifying the Dependent Variable

Specifying the Independent Variables

Specifying Statistics to Include in the Output

Specifying the Direct Entry Method for Selecting Variables

Specifying the Classification Options

Complete the Discriminant Analysis Request

Overall significance of the discriminant function(s)

Stage 4: Estimation of Discriminant Functions and Overall Fit: Assessing Model Fit

Re-running the Discriminant Analysis using separate-groups covariance matrices

Requesting classification using separate-groups covariance matrices

Results of classification using separate-groups covariance matrices

Classification accuracy by chance criteria

Press’s Q statistic

Presence of outliers

Stage 5: Interpret the Results

Stage 6: Validate The Model

Conducting the Validation Analysis

Compute the Variable to Randomly Split the Sample into Two Halves

Specify the Cases to Include in the First Screening Sample

Specify the Value of the Selection Variable for the First Validation Analysis

Specify the Value of the Selection Variable for the Second Validation Analysis

Generalizability of the Discriminant Model

REQUEST A QUOTE